Category Archives: networking

vShield Zones – Some Serious Gotchas

OK..I’ll admit it: I am spoiled by the capabilities of vSphere. What other platform lets you schedule system updates that will occur unattended and without outages of the applications being used? I don’t mean the winders patches, they require a monthly reboot. I am talking about the hypervisor updates. VMware Update Manager coordinates all of this for you. Then along comes vShield Zones to break it all.

First, let me explain what I am trying to do. To simplify things, vShield Zones is a firewall for vSphere Virtual Machines. Rather than regurgitate how it works, take a look at Rodney’s excellent post. A customer has decided to use vShield Zones to help with PCI Compliance. The desire is that only certain VMs will be allowed to communicate with certain other VMs using specific network ports, and to audit that traffic. ’nuff said.

vShield Zones seems to be the perfect solution for this. It works almost seamlessly with vCenter and the underlying ESXi hosts. It provides hardened Linux Virtual Appliances (vShield Agents) to provide the firewalling. It provides a fairly nice management interface to create the firewall rules and distribute them to the vShield Agents. Best of all, IT’S FREE! At least for vSphere Advanced versions and above. Keep in mind, that this is still considered a 1.x release and some things need to be worked out.

Now, on to the gotchas.

Gotcha #1 – Networking

When it comes to networking, the vShield Agent is designed to sit between a vSwitch that is externally connected via physical NICs (pNICs) and a vSwitch that is isolated from the outside world. The vShield Agent installation wizard will prompt you to select a vSwitch to protect. This is illustrated below. The red line indicates network traffic flow.

Click the Image to Enlarge

Click the Image to Enlarge

This works like a champ in this configuration, using a vSwitch for management, which is naturally on an isolated network to begin with, using a vSwitch for VMs to connect to the vShield Agent and using a vSwitch to connect everything to the outside world.  This can also be deployed with limited down time. If you are lucky enough to have the Enterprise Plus version, you may want to use a vNetwork Distributed Switch or even a Cisco 1000v. You will need to make some manual configurations to make this work as outlined in the admin guide.

The gotcha is with blade servers or “pizza box” servers that have limited I/O slots. If all of the VM traffic must flow through the same physical NICs and you use a vSwitch, then you need the vShield Agent to protect a port group rather than an entire vSwitch. You will need to create a vSwitch with a protected port group and connect it to the pNICs. Then you you can install the vShield Agent. Once the vShield Agent is installed, you will need to go back to the vSwitch attached to the pNICs and add an unprotected port group. This is illustrated below. The red line is the protected traffic and the blue line is the unprotected traffic.

Click on Image to Enlarge

Click on Image to Enlarge

As you can see, there is an unprotected Port Group (ORIGINAL Network). This needs to be added to the vSwitch AFTER the vShield Agent is installed. If the ORIGINAL Network is already a part of the vSwitch, it will need to be removed BEFORE installing the vShield Agent. In order to avoid an outage, you will need to disable DRS and manually vMotion all VMs off of the ESX/ESXi host before installing the vShield Agent and modifying the port groups.

Gotcha #2 – DRS/HA Settings

The vShield Agents attach to isolated vSwitches with no pNIC connection. As you should already know, using DRS and vMotion on an isolated vSwitch could cause inter-connectivity between VMs to fail. By default, you cannot vMotion a VM that is attached to an isolated vSwitch. You will need to enable this by editing the vpxd.cfg file. You will also need to disable HA and DRS for the vShield Agents so they stay on the hosts where they are  installed. Both are well documented. Obviously, you will need to install a vShield Agent on every ESX/ESXi host in the cluster.

The Gotcha here is that, with HA disabled for the vShield Agent, there is no facility for automatic startup. There is an automatic startup setting in the startup/shutdown section of the configuration settings. First, this is an all-or-nothing setting. Second, according to the Availability Guide:

“NOTE The Virtual Machine Startup and Shutdown (automatic startup) feature is disabled for all virtual machines residing on hosts that are in (or moved into) a VMware HA cluster. VMware recommends that you do not manually re-enable this setting for any of the virtual machines. Doing so could interfere with the actions of cluster features such as VMware HA or Fault Tolerance.”

So, if a host fails, HA will restart all protected VMs on different hosts. If the host comes back on line, you risk having DRS migrate protected VMs back to that host. This will cause those VMs to become disconnected because the vShield Agent will not automatically start. If a host fails, hope that it fails good enough so it won’t restart.

Gotcha #3 – Maintenance Mode

At the beginning of this post, I mentioned how VMware Update Manager has spoiled me. VUM can be scheduled to patch VMs and hosts. When host patching is scheduled, VUM will place one host in Maintenance Mode, which will evacuate all VMs. Then, it will apply whatever patches are scheduled to be applied, reboot and then exit Maintenance Mode. It will repeat this for each host in a cluster. This works great unless there are running VMs that have DRS disabled, like the vShield Agent.

In the test environment, when a host was manually set to enter Maintenance Mode, it would stall at 2% without moving the test VMs. I am not sure the order that VMs are migrated off, but none were migrated in the test environment. This could vary in different installations. Here’s the gotcha: you cannot power the vShield Agent off because the protected VMs would become disconnected. You cannot migrate it to a different host because it would cause a serious conflict and cause protected VMs to become disconnected. The only thing you can do is place the host in Maintenance Mode, then MANUALLY (*GASP*) migrate all of the protected VMs and then power the vShield Agent off. So much for automated patch management. We’re back to the “oughts.”

Conclusion

I said already that vShield Zones is a 1.x product. It’s a great firewall, but it has a few gotchas that you need to consider. The benefits may outweigh the negatives. But vSphere is a 4.0 product.Some of this should be able to be addressed by tweaking vCenter or host settings.

vShield Zones should be smart enough to allow us to select specific port groups to protect rather than an entire vSwitch. I guess whatever scripting is being done in the background will need to be changed for this. Maybe we need a Ghetto vShield?

One of the REALLY smart people at VMware should be able to tell us the “order of migration” when a host is placed in Maintenance Mode. Once that is determined, there is probably a configuration file somewhere that we could tweak to change it.

There should be a way to set up automatic startup and shutdown of individual VMs. The Startup/Shutdown settings sort of deprecated once DRS was introduced. The only time it is useful is with a stand-alone server or in a NON-DRS cluster. I guess the only thing that could be done is to add a script somewhere in rc.d or rc.local to start up these VMs, but how can that be done in a “supported” fashion with ESXi and is it supported in either ESX or ESXi?

I brought these issues up with some VMware engineers and they assure me that they are working on this. Hopefully they will figure it out soon. I hate doing things manually. It seems like it is anti-cloud.

A Different Take on CEE and FCoE

Last Month, I attended a Brocade Net.Ed Session that covered Converged Enhanced Ethernet (CEE) and Fibre Channel over Ethernet (FCoE) and the idea of Server I/O Consolidation. If you missed the Net.Ed sessions, you can learn about it at Brocade’s Training Portal.  Once you register / login, click on Self-Pased Training and search or browse for FCoE 101 Introduction to Fibre Channel over Ethernet (FCoE).  It’s free. Here is an unabridged report about the Net.Ed session with some of my opinions wrapped in:

Trends

With cloud computing, the consolidation of servers, storage and I/O are becoming popular. Once upon a time, server consolidation ratios were bound by processor and RAM count. With the introduction of servers with higher core count, faster processors and higher RAM capacities, the new boundaries are becoming I/O. related. And the I/O stack is answering the call for faster speeds. If you look at the trends, Fibre Channel speed has gone from 1Gb to 2Gb to 4Gb and now 8Gb. Soon, 16Gb FC will be the norm. Ethernet has gone from 10Mb to 100Mb to 1Gb and now 10Gb. The next chapter will bring 40Gb or 100Gb or both.

Fibre Channel and Ethernet have been in a leap frog contest since Fibre Channel was introduced. And there are plenty of arguments about which is “better” and why. Remember how iSCSI was going to take over the world with storage I/O? Why? Because people think they can implement it on the cheap. If it is implemented properly, it may not be that much cheaper than FC. I see too many instances where admins will implement iSCSI over their existing network, without thought of available bandwidth, security, I/O, etc. Then they complain how iSCSI sucks because of poor performance. Consolidation magnifies this. To top it off, iSCSI doesn’t help when dealing with things like FICON or the many tape drives that need faster throughput than what iSCSI can offer.

Hardware consolidation is also popular, and sometimes occurs during the server consolidation project. Blade servers are becoming more popular for many reasons. Less rack space, less cables, centralized management, etc. are all good reasons for blade servers. I just LOVE walking in to a data center and looking at the spaghetti mess behind the racks! Even with blade servers, the number of cables is still crazy. Some people still have Top of Rack switches, even with blades. More enlightened people have End of Row or Middle of Row switches. But there is still that mess in the back of the rack. I especially love when some genius decides to weave cables through the handles on a power supply….

Consolidate Your I/O

Enter I/O consolidation. Brocade calls it Unified I/O.  This is supposed to reduce cabling even more. I say “maybe.” In order to consolidate I/O, different protocols, adapters and switches are necessary. OH MY GAWD! New technology! This means the dreaded “C” word…Change. In a nutshell, it reduces the connections. You go from two to four NICs and two to four FCAs to two Converged Ethernet Adapters (CNAs). It is supposed to reduce cabling and complexity. It’s supposed to help with OpEx and CapEx by enabling more airflow/ cooling, and saving money on admin costs and cable costs, blah blah blah… Didn’t we hear this about blades too?

The Protocols (Alphabet Soup)

In order to make all of this work and become accepted, you need to worry about things like low latency, flow control and lossless quality. This needs to be addressed with standards. The results are CEE and FCoE. The issue arises with CEE. Not all of the components have been finalized. Things like priority based flow control (IEEE 802.1Qbb), Enhanced Transmission Selection (IEEE 802.1Qaz), Congestion Management (IEEE 802.1Qau) and. The IETF is still working on Transparent Interconnection of Lots of Links (TRILL) which will enable a layer 2 multipath without STP.

Feature/Standard

Benefit

Priority Flow Control (PFC)
IEEE 802.1Qbb

Helps enable a lossless network, allowing storage and networking traffic types to share a common network link

Enhanced Transmission Selection (Bandwidth Management)
IEEE 802.1Qaz

Enables bandwidth management by assigning bandwidth segments to different traffic flows

Congestion Management
IEEE 802.1Qau

Provides end-to-end congestion management for Layer 2 networks

Data Center Bridging Exchange Protocol (DCBX)

Provides the management protocol for CEE

L2 Multipathing: TRILL in IETF

Recovers bandwidth, multiple active paths; no spanning tree

FCoE/FC awareness

Preserves SAN management practices

Source: Brocade Data Center Convergence Overview Net.Ed Session
.
.

My Two Cents

So, without fully functioning CEE, the FCoE cannot traverse the network. This stuff is all supposed to be ratified soon. Until these components are ratified, the dream of true FCoE is just a dream. The bridging can’t be done close to the core yet. So People who decided to start using CNAs and Data Center Bridges will need to place the DCBs close to the server (No Hops!) and terminate their FC at the DCB. In the case of the UCS, this is the Top of Rack or End/Middle of Row switch . In the case of an HP chassis, it’s the chassis, and they don’t even have this stuff yet.

My question is this: Why adopt a technology that is not completely ratified? Like I said before, all of this requires change. You may be in the middle of a consolidation project and you are looking at I/O consolidation. Do you really want to design your data center infrastructure to support part of a protocol? Are you willing to make changes now and then make new changes in six months to bring the storage closer to the core?

So, let’s assume everything is ratified. You have decided to consolidate your I/O. How many connections do you really save? Based on typical blade chassis configurations, it may be four to eight FC cables. But look at it another way: You are losing that bandwidth. A pair of 10Gb CNAs will give you a total of about 20Gb of bandwidth. A pair of 10GbE Adapters and a pair of 8Gb FC adapters gives you about 36Gb. So, sure, you save a few cables. But you give away bandwidth. When you think about available bandwidth, is a pair of 10Gb CNAs or NICs enough? I remember when 100Mb was plenty. If consolidation is becoming I/O bound, do you want to limit yourself?  How about politics? Will your network team and storage team play nice together? Where is the demarcation between SAN and LAN?

I first saw the UCS Blades almost a year ago and I was excited about the new technology. Their time is coming soon. The HP Blades have always impressed me since they were introduced. They will never go away. I have used the IBM and Dell blades. My mother always said that if I didn’t have anything nice to say about something, don’t say anything at all…

When I take a look at the server hardware available to me now (HP and Cisco), I see pluses and minuses to both. The UCS Blades have no provisions for FC, so you need to drink the FCoE Kool Aide or use iSCSI. The HP blades allow for more I/O connections and can support FC, but not FCoE. If you want to make the playing field similar, you should compare UCS to the HP Blades with Flex-10. This will make the back-end I/O modules similar. Both act as a sort of matrix to map internal I/O to external I/O. Both will pass VLAN tags for VST and both will accommodate the 1000-v dvSwitches. The thing about Flex-10 is that it requires a different management interface if you are already a Cisco shop.

There’s a fast moving freight train called CHANGE on the track. It never stops. You need to decide when you have the guts to jump on and when you have the guts to jump off.