qcow2文件分析

A qcow2 virtual disk contains an image header, a twolevel lookup table, a reference table, and data clusters, as shown in Figure 1. The image header resembles the superblock of a file system, which contains the basic information of the image file such as the base address of the lookup table and the reference table. The image file is organized at the granularity of cluster, and the size of the cluster is stored in the image header. The lookup table is used for address translation. A virtual block address (VBA) a in the guest VM is split into three parts, i.e., a=(a1, a2, a3): a1 is used as the L1 table’s index to locate the corresponding L2 table; a2 is used as the L2 table’s index to locate the corresponding data cluster; a3 is the offset in the data cluster. The reference table is used to track each cluster used by snapshots. The refcount in reference table is set to 1 for a newly allocated cluster, and
its value grows when more snapshots use the cluster.

qcow2文件分析

The process of writing some new data to a virtual disk
includes following steps:
1 Look up the L1 table to get the offset of the L2
table. 2 If the L2 table is not allocated, then set the corresponding
reference table entry to allocate a cluster for
the L2 table, and initialize the new L2 table. 3 Update
the L1 table entry to point to the new L2 table if a new

L2 table is allocated. 4 Set the reference table to allocate
a cluster for data. 5 Write the data to the new data
cluster. 6 Update the L2 table entry to point to the new
data cluster.
Note that, each step in the whole appending process
should not be reordered; otherwise, it may cause metadata
inconsistency.

The organization of qcow2 format requires extra efforts
to retain crash consistency such that the dependencies between
the metadata and data are respected. For example,
a data cluster should be flushed to disk before updating
the lookup table; otherwise, the entry in the lookup table
may point to some garbage data. The reference table
should be updated before updating the lookup table; otherwise,
the lookup table may point to some unallocated
data cluster.
We use two simple benchmarks in QEMU-2.1.2 to
compare the number of sync operations in the guest VM
and the host: 1) “overwrite benchmark”, which allocates
blocks in advance in the disk image (i.e., the qcow2 image
size remains the same before and after the test); 2)
“append benchmark”, which allocates new blocks in the
disk image during the test (i.e., the image size increases
after the test). The test writes 64KB data and calls fdatasync
every 50 iterations. We find that the virtual disks
introduce more than 3X sync operations for qcow2 and
4X for VMDK images, as shown in Figure 2.As shown in Figure 3, a fdatasync of the user application
can cause a transaction commit in the file system.
This requires two flushes (in guest VM) to preserve its
atomicity, which are then translated into two set of writes
in QEMU. The first write puts the data and the journal
metadata of the VMto the virtual disk, which in the worst
case, causes its size to grow.
To grow the virtual disk in QEMU, a data block must
be allocated, and the corresponding reference table block
should be set strictly before other operations. This necessitates
the first flush. After that, the L2 data block must
be updated strictly before the remaining operations. This
necessitates the second flush. (In some extreme cases
where the L1 data block should be updated as well, it introduces
even more flushes). The third flush is used to
update the base image’s reference table. When creating

a new image based on the base image, the refcount in the
reference table of the base image will be increased by
one to indicate that another image uses the base image’s
data. When updating the new image, qcow2 will copy
data from the base image to a new place and do updates.
The new image will use the COW data and will not access
the old data in the base image, so the refcount in the
base image should be decreased by one. The third flush
is used to make the reference table of the base image
durable. The fourth flush is introduced solely because
of the suboptimal implementation in QEMU. The second
write is the same as the first one, which needs four
flushes. Consequently, we need around eight flushes for
one guest fdatasync at most.