VMware recognized pretty early that backing up virtual machines (VMs) through the traditional backup approaches (client in VM) was not going to work very well. The problem is simple: Traditional backups are extremely resource-intensive in terms of both CPU and memory. I've always said that nothing stresses a server more than backing it up. As a matter of fact, nothing stresses your network more than a backup application – but that's a topic for a different time.
The traditional client backup method worked fine in the physical world because we had excess CPU and memory. Well, that's exactly why we can virtualize the machines. The physical server that hosts the VMs can do its magic because each VM is not consuming all of the resources. Backups will consume everything you give them – that's a problem.
In versions 3.x of ESX, a backup API solution was released by VMware. It was called VMware® Consolidated Backup (VCB). As a backup solution, it was a good attempt. But in practical terms, it did not scale, it was difficult to manage, and reliability suffered because of it. Much to their credit, VMware learned their lessons and released a completely new set of APIs in vSphere 4.0 (and later). These are collectively known as the vStorage APIs. They come in two flavors: VAAI and VADP. VAAI (vStorage API for Array Integration) is a set of APIs that allow the storage vendors to have much tighter integration with the hypervisor. VADP (vStorage API for Data Protection) is a set of tools for backup vendors to efficiently protect resources in a vSphere environment. Both APIs are very rich topics of discussion.
Exploring VMware vStorage APIs even further
As I mentioned, vStorage APIs come in two flavors: VADP and VAAI. They are both interesting, but for this discussion, let's focus on the basics of the VAAI. This specific set of APIs are designed so that the VMware kernel and storage arrays can better integrate with each other. The gist is that a lot of the tasks that were performed by the ESX Kernel can now be offloaded to the storage arrays. This is good for several reasons:
- The VMware kernel/physical server is freed from doing tedious, but resource-intensive tasks.
- The arrays have better "visibility" into what's being stored on them and can better optimize their functions.
- Time-consuming tasks can be minimized and optimized.
- Scalability is improved.
How's that done? Well, in vSphere 4.1, there are three "primitives" (supposed to be four, but we’ll discuss that later). These primitives are specific functions that the arrays’ vendors can choose to implement. Each of these delegate specific functionality from the ESX host down to the array. They are:
- __Full Copy__: Offloads the copying of data from the ESX host down to the array.
- __Block Zeroing__: The array does the work of zeroing out large chunks of space on disk.
- __Hardware-assisted Locking__: extends the ways that VMFS protects critical information
Let's look Full Copy and Block Zeroing in this post. I’ll follow up with another post on Hardware-assisted Locking.
Full Copy
One of the most common tasks in VMware is the creation of virtual machines (VMs). Typically, this is done from a template. Without VAAI, the ESX host tasked with creating the copy is doing all of the I/O. It literally has to read each block from the template VM and copy it over to the destination VM. This is very time consuming and can be quite a burden on the ESX host, not to mention the HBA, SAN, etc. With __Full Copy__ enabled, the ESX host is still involved in the copy operation, but mostly as a controller. The vast majority of the work is done by the array itself. This not only frees up the host from having to do the I/O, but it can dramatically improve the performance of the operations. We have seen performance increase with some arrays by as much as 10 times. As time goes by and the arrays get smarter about this, there's no reason not to expect even higher performance numbers.
Another place you end up moving lots of bits is when you do Storage vMotion. Arrays that implement __Full Copy__ can also offload this function from the ESX host. Similar to VM provisioning, an ESX host is still involved in the process, but only as a controlling mechanism. The bulk of the I/O is handled by the array. This too has a significant increase in performance, but our experience has not been as dramatic as during provisioning. I fully expect this to get better over time as well.
Block Zeroing
When you create a VM on block storage, you have three options on how to allocate the space: thin provision, zeroedthick, and eagerzeroedthick. The first one basically tells VMware not to pre-allocate any space on disk for the VM. The VM essentially thinks it has a disk of size x, but only consumes space on disk when the guest OS does a write. Very cool, but this has several performance implications. We'll discuss that some other time.
With zeroedthick format, VMware pre-allocates all of the space for a VMDK disk image. However, it defers the zeroing (blanking) of that space until the first time the guest OS accesses each specific block. With this, you create your VMDKs really fast, but you pay a performance penalty each time you access a block for the first time because you have to first zero it before you write your actual data.
In contrast, eagerzeroedthick format also pre-allocates all of the space for a VMDK disk image and at the same time zeroes out all of the blocks. As you might imagine, this takes a long time, and the ESX host has to write all of those zeros. The good news is that after this, the VM doesn't pay a penalty when it starts using its disk image. For many VM administrators, this is sufficient for them to pay the deployment cost. __Block Zeroing__ alleviates this and gives you the convenience of zeroedthick deployment with the speed of eagerzeroedthick. Naturally, this is done by the API by offloading the "zero-writing" down to the array. It still takes time, but it is faster and the ESX host doesn't have to generate all those I/Os.
One of the neat things that comes from this is that if your array supports thin provisioning at the hardware level and VAAI, it can start doing some nifty tricks. First, if the LUN is thin provisioned, the array doesn't have to actually write zeroes to each block. VAAI tells the array to do it, and the array responds immediately that it did – but in reality, all it did was throw those operations away. You see, you don't need to zero them out. They haven't been used (they are all zeros after all). The array is smart enough to know that, and it does not allocate the space until you really, really need it.
Also, arrays that support thin provisioning often provide the ability to reclaim "zeroed pages." With thin provisioning, LUNs start off small, but grow over time even if the OS has deleted files. The OS and the array don't communicate and the array typically doesn't know that a file has been deleted and its blocks can be reclaimed. There are some neat ways of solving this with physical hosts, but they don't really map all that well to VMware VDMK files. That was a big problem – that is until VAAI and __Block Zeroing__. With this combo, smart arrays can now give you space efficiency and ... well space efficiency. Trust me. It's very nice.
Digging into hardware-assisted locking.
Like the other primitives, this is a specific function that the arrays’ vendors can choose to implement.
VMFS is a very clever distributed cluster file system. It is the underpinning of much of what you do with VMware: vMotion, virtual machine (VM) creation, snapshots, and many other wizzbang things. To do its magic, VMFS has to coordinate and guarantee the safety of your VMs at all times. Because multiple hosts can access the same file system at the same time, this means that it has to have a way of ensuring that only a single host does certain operations at time. Imagine if two hosts tried to start the same VM at the same time. That'd be like crossing the streams – you'd get total protonic reversal. Ok – not that bad, but you would get your VM, and possibly your entire ESX farm, into a pickle. To make sure this doesn't happen, VMFS uses SCSI reservations. This control mechanism (a.k.a. Node Fencing) is very coarse. The granularity it can control is at the LUN level, and because of that, if you have multiple VMs on a single datastore, certain operations can only be done to __one__ VM at a time. The list of operations is somewhat surprising. This is the list of things you can only do one at a time on each block datastore:
- VMotion migrations
- Creating a new VM or template
- Deploying a VM from a template
- Powering a VM on or off (Really – try starting two VMs on the same _block_ datastore at the same time. No can do.)
- Creating, deleting, or growing a file
- Creating, deleting, or growing a snapshot
And that is one of the big reasons that VMware's best practices have been to limit block datastores to 500 GB or less. You just can't put that many VMs in 500 GB. That means your chances of creating a conflict are smaller. Which is a good thing. BUT... that also means you typically end up with gobs of datastores. And that you artificially limit the size of your clusters. And that you waste space because each datastore ends up with unusable free space.
I've been saying "block datastores" a lot though, haven't I? Well, it turns out that a lot of these limitations don't apply to NFS datastores. This is one of the big reasons that people love NFS datastores. You can create really big datastores. You can have multiple operations at the same time. Arrays can handle lots of the file management magic for you (hardware-based snapshots). And there are others. I'll cover those in yet another blog (told you we have lots to cover). As good as NFS is, its problem is that it is a TCP/IP-based protocol. Yup. Now, I don't want to get into the Fibre vs. ethernet debate. Let's just leave it at this: Fibre+FCP is a much more efficient stack for disk I/O than ethernet+TCP/IP. If performance is at an absolute premium, typically it's easier to architect a FCP-based solution than a TCP/IP-based one. Notice my words: typically and usually. OK ethernet fans?
Now that we have that out of the way, enter __Hardware-assisted locking__. With this API, VMFS can lock at a much more granular level: the individual block. In practical terms, that means that each VMDK file can be locked individually. This API essentially eliminates or reduces the limits on all of the operations I described above. You can have VMFSs that are larger, host more VMs, and cluster more hosts.
So, what does it mean?
VAAI is a game changer for VMware. So much of what you do on a day-to-day basis with VMware is centered around the storage. With these APIs, much of that is simplified and/or sped up dramatically. If you are deploying VMware in vSphere 4.1, you owe it to yourself to make sure that your storage vendor supports VAAI. All of it.
But... because you are going to be able to do more with your storage, that means that your VMware farms, VMs, and storage needs are going to grow. Storage vendors like that. Backup vendors like it even more. And with that I leave you hanging for my next post, where I'll cover the other half of vStorage, VADP.