When you get the stats for the size of a folder, where, exactly, do those measurements come from?
When you right-click to view the properties of a folder, the property sheet includes two values: Size and Size on disk. What exactly do these values mean? What are they measuring?
The property sheet performs a naïve recursive directory search for all files. It doesn’t try to filter out file names referring to the same underlying file by means of a hard link. If you don’t have access to a subdirectory, the recursive directory search will skip that subdirectory, and those files won’t be counted in the total folder size.
As it turns out, the recursive directory search has some smarts. Part of it is being smart on purpose: It detects reparse points and doesn’t recurse into them. Another part is being smart by accident: Symbolic links to files count as zero size. This isn’t because the directory search code is clever about the files. It’s because the directory entry for symbolic links reports them as having zero size. Now you know what files are counted, but where do those numbers come from?
The Size measurement is easy: It’s a running tally of the file sizes as reported by the FindFirstFile function in the WIN32_FIND_DATA.nFileSizeLow and nFileSizeHigh. Mind you, those values aren’t necessarily accurate either because of the way the NTFS file system updates directory entries. That’s a topic for another day, but the short version is that files still being written to may not report an accurate file size until the file handle is closed. Even then, it will only update the directory entry used to open the file.
The Size on disk measurement is more complicated. If the drive supports compression (as reported by the FILE_FILE_COMPRESSION flag returned by the GetVolumeInformation function) and the file is compressed or sparse (FILE_ATTRIBUTE_COMPRESSED, FILE_ATTRIBUTE_SPARSE_FILE), then the Size on disk for a file is the value reported by the GetCompressedFileSize function. This reports the compressed size of the file (if compressed) or the size of the file minus the parts that were de-committed and logically treated as zero (if sparse). If the file is neither compressed nor sparse, then the Size on disk is the file size reported by the FindFirstFile function rounded up to the nearest cluster.
The Windows 95 team originally developed the Size on disk algorithm. Their view of the file system world was biased by their MS-DOS background. There the only disk file system was FAT. There was no such thing as a hard link or alternate data stream. File contents were stored in units of clusters.
Those assumptions don't hold true for NTFS—not even the “file contents are stored in units of clusters” part. In NTFS, a file can actually consume zero clusters for its data by stashing itself into slack space in the master file table (MFT). (For more details on this, see “The Four Stages of NTFS File Growth.”)
Naturally, the Size on disk algorithm doesn’t take into account other file system overhead, like the disk space occupied by the file name itself, directory entry information, file metadata and alternate data streams.
The values reported by Size and Size on disk aren’t meant to be a byte-for-byte accounting of the total impact of a directory on your disk free space. They’re just a rough estimate based on the assumption that most files are of the boring variety. By that, I mean no hard links and negligible use of alternate data streams. If you have a directory with numerous hard links—such as the Windows directory itself, for example—the values will be way off.
You can use Size on disk as a sniff-test to get a rough idea of the size of a directory, but remember that it’s a naïve calculation. If you need to keep careful tabs on disk consumption, you’d be better off using a feature like Disk Quotas, whose purpose is to more intelligently track disk consumption.
Raymond Chen's Web site, The Old New Thing, and identically titled book (Addison-Wesley, 2007) deal with Windows history, Win32 programming and telling jokes to three-year-olds.