A Disk Space Mystery

Hunting for 20GB of missing disk space where the usual culprits turned out to be innocent...

The server in question is an Azure VM running CentOS 7 that does batch processing of large log files. Logs are streamed to the box via rsyslogd, processed, then archived as hourly compressed files courtesy of logrotate. All very straight forward stuff.

I’d seen that the available disk space had started to go down dramatically. Usually the machine runs with well over 50% disk free and has a very predictable disk usage pattern, but now it was eating through the available buffer space at an alarming rate.

Standard Steps

Running df showed me a couple of GBs free when we’d typically have considerably more!

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.7G     0  1.7G   0% /dev
tmpfs           1.7G     0  1.7G   0% /dev/shm
tmpfs           1.7G  968K  1.7G   1% /run
tmpfs           1.7G     0  1.7G   0% /sys/fs/cgroup
/dev/sda2        30G 27.8G  2.1G  94% /                 <-- unexpected!
/dev/sda1       497M   64M  433M  13% /boot
/dev/sdb1        50G    4G   44G   8% /mnt/resource
tmpfs           345M     0  345M   0% /run/user/1002

Next I ran du over the root directory and ordered the result by large entries, e.g. du -chs /* | sort -h. I had perhaps expected to find some large set of log files copied out to a user directory or similar. That didn’t happen, instead du gave me a total disk used of 8GB. A little discrepancy is expected, but a 20GB gap is not.

It is not completely unexpected to find the results of df and du don’t match. You might think df/disk-free should complement du/disk-used, and they sort of do, but they are coming at values in different ways.

df shows the blocks allocated in the entire file system, including inodes and other meta data.
du shows the blocks actually allocated to an individual file.

One typical reason for the discrepancy is one or more processes hanging on to file handles that are attached to deleted files. You can easily confirm this with lsof. The fix is often a process restart (or sending a suitable signal) to whichever process has the handle…

$ lsof | grep deleted
$

However that is not our issue this time.

Where else can disk space go?

Clutching at straws a little, I wondered if the disk was very fragmented, although to lose multi GBs of space seemed a big stretch. I also don’t think I have ever had to defrag an XFS file system, but a bad plan is better than no plan, indeed this proved true here as it pointed me in the right direction.

xfs_fsr is the tool required. Running this threw an error about a file too large to move after a couple of minutes. The error listed the large file’s inode number and size. The size was around 20GB. Suspicious…

xfs_fsr works by copying files to new (disk) locations, so this error makes sense when our total disk is only 30GB. It also tells us that we really do have a very large file somewhere.

Trying to go from the inode to the file path proved unsuccessful:

$ find / -xdev -inum 17313616
$

Accidentally Hiding Data

To recap so far:

We don’t have any deleted files still being used.
We do appear to have an inaccessible large file with an inode we can’t convert to a file path.

Searching a little online for some pointers I came across an interesting forum post about disk mounts that can obscure other file locations. That’s an interesting idea. This server does have two disks in use: one standard disk /, one temp disk (runtime only, data lost on reboot, mounted at /mnt/resource).

I was able to prove this was the issue by remounting the primary disk with mount’s --bind option. I tried this with mount --bind / /tmp/tmpmount, so /foo was also available at /tmp/tmpmount/foo. Running du -chs /tmp/tmpmount then gave me 28GB not 8GB and sure enough it is because /tmp/tmpmount/mnt/resource existed with a big 20GB temporary log file inside. I could fix by deleting the file from /tmp/tmpmount/mnt/resource/bigfile.

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        1.7G     0  1.7G   0% /dev
tmpfs           1.7G     0  1.7G   0% /dev/shm
tmpfs           1.7G  968K  1.7G   1% /run
tmpfs           1.7G     0  1.7G   0% /sys/fs/cgroup
/dev/sda2        30G  4.7G   25G  16% /                 <-- more like it!
/dev/sda1       497M   64M  433M  13% /boot
/dev/sdb1        50G   17G   30G  36% /mnt/resource
tmpfs           345M     0  345M   0% /run/user/1002

How did this happen?

Here is the theory:

The virtual machine starts up/reboots.
rsyslog started listening for data and streaming it out to /mnt/resource/bigfile.
It did this before the temp disk had been mounted.
Once the mount took place the original file was hidden. du is going through all files summing their size, but won’t see /mnt/resource/bigfile.
The file then continued to grow as each hour when logrotate ran it could not see any files to operate on.

Restarting rsyslogd would have half resolved the issue. Files would then start appearing in the right place, but the existing large file would be left and stuck there forever.

Fixing The Bad Mount

systemd has support for saying ‘only start this service after X condition is met’. Often we think of that as another service, but you can include mount points as well (run systemctl list-units --type=mount to see what’s available). If this is news to you then check out Digital Ocean’s systemd page which is a helpful condensed guide. We also don’t need to override the system installed rsyslog unit file, which might get lost on upgrade. We can instead place a file in a new sub-directory. Our fix below assumes the core file is at /usr/lib/systemd/system/rsyslog.service

$ cat /usr/lib/systemd/system/rsyslog.service.d/wait-for-mount.conf
# Ensure our temp disk for log storage is ready to go before starting
[Unit]
After=mnt-resource.mount

Mystery solved and fixed for the future!

Postscript - Is this really fixed?

Coming back to this draft ahead of hitting publish and it turns out the fix might not be as straight forward in all cases. Let me explain a little…

The Azure temporary disk mount appears to be a little different to normal disk mounts. waagent.service is responsible for the mount (not /etc/fstab, which will just fail for this case), however setting After=waagent in your unit config doesn’t resolve this issue as systemd considers the waagent service running as being the same as ready.

I had wondered about using systemd auto-restart on rsyslogd with a Condition line based on the mount directory, but if the condition isn’t met it won’t then try to load it again (it’s different to having a process fail and get auto restarted).

Some online suggestions are along the lines of having a script that checks for the mount on a timer and then kick starts the required service. I decided to go for a simpler ~~hack~~solution and just set an ExecDelay=30s in the rsyslog extra unit file. It does the job and considering the wider processing happening here is temporary anyway it doesn’t seem worth creating more complexity.

Dev SysAdmin

Falkus .co

Falkus.co

22 Feb 2023