Rassin' Frassin' VM Issue

Started by deanwebb, June 16, 2015, 06:07:02 PM

Previous topic - Next topic

deanwebb

I have three virtual network appliances in one of our data centers. I had two virtual NICs set up for each one. QIP had a MAC address and a corresponding IP address that could be reserved for each virtual NIC. So, I had 10.0.0.1 through .6 set aside and assigned to my virtual network appliances. (IP addresses changed to protect my firm's internal addressing scheme)

That was two months ago.

Today, I tried to SSH to one of the devices and got some really funky intermittent connectivity. I mention this to a manager who tells me that he'll look into the issue after he resolves another issue in the data center... well, it turns out that the issues are connected.

See, we changed staffing partners for our VM environment a month and a half ago, and apparently those guys didn't pay full attention to the whole "Don't assign MAC addresses and IP addresses that have already been assigned" instruction. We weren't using the second IP address on those devices as yet, so those all got assigned to Windows Server VMs. But the main address for the device I was trying to connect to, 10.0.0.5, was assigned to another Windows Server VM. That box had the same MAC address, as well.

:developers:

So, tomorrow, I get to get on the phone and holler at the guys responsible for stealing my four IP addresses. I got there first, I have to have the addresses be consecutive, so those four servers have to get new IP addresses. Too bad so sad.

... unless those are mission-critical windows servers that can't easily change their IP addresses. Then the secondary IP addresses will be lost and the primary address on the third appliance will be my last-ditch stand.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Reggle

I hate it when that happens. It's just proof of not being professional. When you need an IP in a network you don't own, don't assume.

SimonV

Isn't the VMs MAC address automatically derived from the IP address? Never heard of handing out MAC addresses

deanwebb

Quote from: SimonV on June 17, 2015, 04:20:11 PM
Isn't the VMs MAC address automatically derived from the IP address? Never heard of handing out MAC addresses
The available MAC addresses and the IP addresses pre-assigned to said MAC addresses were all in a table for checkout. Not sure how that all went, but the VM operator actually edits the MAC address of the VM and, hey presto!, the VM now has the appropriate IP address.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

New issue: oversubscription in one VM environment caused some issues with the virtual network appliances that were functioning as RADIUS servers, with hilarious results, for values of hilarious that include "RADIUS timeouts not allowing users on to the wireless network".

Was this something we warned about going into the NAC project? Yes. Was this considered when purchasing decided the all-virtual solution would be better, money-wise? Not really. We had the assumption that we'd get dedicated resources, but that was not delivered. Now we're payin' the piper...

In other VM news, the guy running Cisco Prime and MSE  was wondering aloud why he's not able to reliably connect to his virtual devices anymore... more cases of VM guys not having a good procedure for IP address management.

Lesson learned, and it's a big one: if your firm wants to go with a 100% virtual solution (or as much virtual as possible), then those devices have to be treated specially. Network devices are more than just Windows file servers, and the network will suffer gravely if there's any congestion in the processors, hard drives, memory, or - gasp - backplane. Price in that dedicated virtual environment when comparing the cost of virtual to physical, or suffer later on when nobody on the project has the cash to get what is needed.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

mmcgurty

Quote from: deanwebb on July 01, 2015, 06:21:40 PM
New issue: oversubscription in one VM environment caused some issues with the virtual network appliances that were functioning as RADIUS servers, with hilarious results, for values of hilarious that include "RADIUS timeouts not allowing users on to the wireless network".

Was this something we warned about going into the NAC project? Yes. Was this considered when purchasing decided the all-virtual solution would be better, money-wise? Not really. We had the assumption that we'd get dedicated resources, but that was not delivered. Now we're payin' the piper...

In other VM news, the guy running Cisco Prime and MSE  was wondering aloud why he's not able to reliably connect to his virtual devices anymore... more cases of VM guys not having a good procedure for IP address management.

Lesson learned, and it's a big one: if your firm wants to go with a 100% virtual solution (or as much virtual as possible), then those devices have to be treated specially. Network devices are more than just Windows file servers, and the network will suffer gravely if there's any congestion in the processors, hard drives, memory, or - gasp - backplane. Price in that dedicated virtual environment when comparing the cost of virtual to physical, or suffer later on when nobody on the project has the cash to get what is needed.

This is a really good topic.  I am not a VMware guy at all, so I can't speak to whether or not this is true.  We were having issues with a large deployment of Cisco Prime Infrastructure for Wireless.  We had a guy on our team that was literally spending 20-30hrs per week with Cisco TAC and the Cisco Business Unit for both Prime and Wireless trying to figure out why the platform wasn't stable and the database was corrupting.  We come to find out when it was built that the VM guys didn't provision it with dedicated resources and it wasn't getting enough resources to do some of its functions.  This was resulting in poor end user performance trying to work inside of Prime as well as database issues (assuming from not enough CPU/memory to write things in the DB).

We were told that if we wanted dedicated resources for this VM that we had to purchase our own blade (or blades if we want redundancy) because they cannot provision dedicate resources for VM's in the existing cluster or it will cause issues with all the other VM's.  Now this would be fine if we all had $40K-$80K every time we wanted to do this but they never even mentioned it until we started having issues and that money is done spent elsewhere.

We also wanted to recently deploy an F5 Virtual Load Balancer pair to support an emergency infrastructure.  We got the same song and dance about needing our own blades to run this on dedicated.  I really don't understand how you cannot provision some VM's with dedicated resources within the same cluster not requiring dedicated resources.  I think someone doesn't know what they are doing but I don't know enough about it call them out on it.

Nerm

Quote from: mmcgurty on July 02, 2015, 07:33:40 AM
Quote from: deanwebb on July 01, 2015, 06:21:40 PM
New issue: oversubscription in one VM environment caused some issues with the virtual network appliances that were functioning as RADIUS servers, with hilarious results, for values of hilarious that include "RADIUS timeouts not allowing users on to the wireless network".

Was this something we warned about going into the NAC project? Yes. Was this considered when purchasing decided the all-virtual solution would be better, money-wise? Not really. We had the assumption that we'd get dedicated resources, but that was not delivered. Now we're payin' the piper...

In other VM news, the guy running Cisco Prime and MSE  was wondering aloud why he's not able to reliably connect to his virtual devices anymore... more cases of VM guys not having a good procedure for IP address management.

Lesson learned, and it's a big one: if your firm wants to go with a 100% virtual solution (or as much virtual as possible), then those devices have to be treated specially. Network devices are more than just Windows file servers, and the network will suffer gravely if there's any congestion in the processors, hard drives, memory, or - gasp - backplane. Price in that dedicated virtual environment when comparing the cost of virtual to physical, or suffer later on when nobody on the project has the cash to get what is needed.

This is a really good topic.  I am not a VMware guy at all, so I can't speak to whether or not this is true.  We were having issues with a large deployment of Cisco Prime Infrastructure for Wireless.  We had a guy on our team that was literally spending 20-30hrs per week with Cisco TAC and the Cisco Business Unit for both Prime and Wireless trying to figure out why the platform wasn't stable and the database was corrupting.  We come to find out when it was built that the VM guys didn't provision it with dedicated resources and it wasn't getting enough resources to do some of its functions.  This was resulting in poor end user performance trying to work inside of Prime as well as database issues (assuming from not enough CPU/memory to write things in the DB).

We were told that if we wanted dedicated resources for this VM that we had to purchase our own blade (or blades if we want redundancy) because they cannot provision dedicate resources for VM's in the existing cluster or it will cause issues with all the other VM's.  Now this would be fine if we all had $40K-$80K every time we wanted to do this but they never even mentioned it until we started having issues and that money is done spent elsewhere.

We also wanted to recently deploy an F5 Virtual Load Balancer pair to support an emergency infrastructure.  We got the same song and dance about needing our own blades to run this on dedicated.  I really don't understand how you cannot provision some VM's with dedicated resources within the same cluster not requiring dedicated resources.  I think someone doesn't know what they are doing but I don't know enough about it call them out on it.

Sounds to me like the host systems are oversubscribed which is why they cannot dedicate resources. If you have 32 cores and 32 VM's each needing 1 core but one needs say 2 cores if you dedicate that it affects all of them. They are probably just giving the song and dance because they don't want to admit they are oversubscribing their host systems.

deanwebb

The VM hazard can be phrased as "a problem of a company of a certain size", to wit:

Small: Everyone knows that the VM platform is oversubscribed, but you can't afford hardware, so you lump it.

Medium: You know the VM guy, he gets a special environment just for you and networking, and it's great for a month... until they run out of room elsewhere and have to start spinning up VMs in your environment as an emergency measure that becomes permanent.

Large: The VM architect assures the Network architect that there is plenty of room, because that's the truth according to his nice, clean Visio document. The vendor sales guy says that the VM will do what the hardware does, because he wants to make that sale. Accounting and purchasing love not spending money, so going virtual is a massive savings and gets the green light from them. Because the assumption was made that there was plenty of quality room in the VM environment, not once does anyone think about purchasing blades or blocks for exclusivity and reliability. At which point, please see "Small" for the end result, substituting "the network guy and the VM guy" for "everyone."

***

As for dedicated resources, it may be entirely possible to do so, but only if the internal accounting charges are paid for.

But there's also the problem with the VM culture in a company. If the culture is all about fractional reserve computing to cram in as many devices so as to save bucks, then network gear is going to die hard, because it expects to be able to make a full withdraw of resources whenever it needs them. When those resources aren't there, the Windows boxes can abide a while, but the network needs things to be real-time. If there are no more resources allocated on the VM host than are available, the network devices are happy, but then the cost of the dedicated host plus the VM licenses may equal or exceed the costs of straight hardware.

I am feeling like I might be forced to learn about VMs not because I want to get into that technology, but because I want to get OUT of it...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

Enterprise: The VM guys manage several VM clusters in different distribution blocks, across the enterprise. some are over subscribed some are not, so they request the ability to VM any machine to any of the clusters throughout the enterprise, so they can balance loads. But nobody had any foresight back in the day, and the awesome ancient network guys isolated vlans only to the local distribution area rather then the enterprise. So, when the network team tries to make this work, they realize that some of the needed vlan numbers are in use in other distribution areas and end up having to remap vlans across the different distribution blocks, create l2 extensions between distribution blocks, and circumvent the firewalls to make it all work. sheesh.

:professorcat:

My Moral Fibers have been cut.