Hunting for 20GB of missing disk space where the usual culprits turned out to be innocent...
The server in question is an Azure VM running CentOS 7 that does
batch processing of large log files.
Logs are streamed to the box via rsyslogd
, processed, then archived as
hourly compressed files courtesy of logrotate.
All very straight forward stuff.
I’d seen that the available disk space had started to go down dramatically. Usually the machine runs with well over 50% disk free and has a very predictable disk usage pattern, but now it was eating through the available buffer space at an alarming rate.
Standard Steps
Running df
showed me a couple of GBs free when we’d typically have
considerably more!
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.7G 0 1.7G 0% /dev
tmpfs 1.7G 0 1.7G 0% /dev/shm
tmpfs 1.7G 968K 1.7G 1% /run
tmpfs 1.7G 0 1.7G 0% /sys/fs/cgroup
/dev/sda2 30G 27.8G 2.1G 94% / <-- unexpected!
/dev/sda1 497M 64M 433M 13% /boot
/dev/sdb1 50G 4G 44G 8% /mnt/resource
tmpfs 345M 0 345M 0% /run/user/1002
Next I ran du
over the root directory and
ordered the result by large entries, e.g. du -chs /* | sort -h
. I had perhaps
expected to find some large set of log files copied out to a user
directory or similar. That
didn’t happen, instead du
gave me a total disk used of 8GB.
A little discrepancy is expected, but a 20GB gap is not.
It is not completely unexpected to find the results of df
and du
don’t match. You might think
df
/disk-free should complement du
/disk-used, and they sort of do, but they
are
coming at values in different ways.
- df shows the blocks allocated in the entire file system, including inodes and other meta data.
- du shows the blocks actually allocated to an individual file.
One typical reason for the discrepancy is one or more processes hanging
on to file handles that are attached to deleted files. You can easily
confirm this with lsof
. The fix is often a process restart (or sending a suitable signal)
to whichever process has the handle…
$ lsof | grep deleted
$
However that is not our issue this time.
Where else can disk space go?
Clutching at straws a little, I wondered if the disk was very fragmented, although to lose multi GBs of space seemed a big stretch. I also don’t think I have ever had to defrag an XFS file system, but a bad plan is better than no plan, indeed this proved true here as it pointed me in the right direction.
xfs_fsr
is the tool required.
Running this threw an error
about a file too large to move after a couple of minutes. The error listed
the large file’s inode number and size. The size was around 20GB. Suspicious…
xfs_fsr
works by copying files to new (disk) locations, so this
error makes sense when our total disk is only 30GB. It also tells us that we
really do have a very large file somewhere.
Trying to go from the inode to the file path proved unsuccessful:
$ find / -xdev -inum 17313616
$
Accidentally Hiding Data
To recap so far:
- We don’t have any deleted files still being used.
- We do appear to have an inaccessible large file with an inode we can’t convert to a file path.
Searching a little online for some pointers I came across an interesting
forum post about disk mounts that can obscure other file locations.
That’s an interesting idea.
This server does have two disks in use: one standard disk /
, one temp
disk (runtime only, data lost on reboot, mounted at /mnt/resource
).
I was able to prove this was the issue by remounting the primary disk with mount’s --bind
option.
I tried this with mount --bind / /tmp/tmpmount
, so /foo
was also available
at /tmp/tmpmount/foo
.
Running du -chs /tmp/tmpmount
then gave me 28GB not 8GB
and sure enough it is because /tmp/tmpmount/mnt/resource
existed with a big
20GB temporary log file inside.
I could fix by deleting the file from /tmp/tmpmount/mnt/resource/bigfile
.
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 1.7G 0 1.7G 0% /dev
tmpfs 1.7G 0 1.7G 0% /dev/shm
tmpfs 1.7G 968K 1.7G 1% /run
tmpfs 1.7G 0 1.7G 0% /sys/fs/cgroup
/dev/sda2 30G 4.7G 25G 16% / <-- more like it!
/dev/sda1 497M 64M 433M 13% /boot
/dev/sdb1 50G 17G 30G 36% /mnt/resource
tmpfs 345M 0 345M 0% /run/user/1002
How did this happen?
Here is the theory:
- The virtual machine starts up/reboots.
rsyslog
started listening for data and streaming it out to/mnt/resource/bigfile
.- It did this before the temp disk had been mounted.
- Once the mount took place the original file was hidden.
du
is going through all files summing their size, but won’t see/mnt/resource/bigfile
. - The file then continued to grow as each hour when logrotate ran it could not see any files to operate on.
Restarting rsyslogd
would have half resolved the issue. Files would then start
appearing in the right place, but the existing large file would be
left and stuck there forever.
Fixing The Bad Mount
systemd
has support for saying ‘only start this service after X condition is
met’. Often we think of that as another service, but you can include mount
points as well (run systemctl list-units --type=mount
to see what’s
available). If this is news to you then check out Digital Ocean’s
systemd page which is a helpful condensed guide. We also don’t need to
override the system installed rsyslog
unit file, which might get lost on
upgrade. We can instead place a file in a new sub-directory. Our fix
below assumes the core file is at /usr/lib/systemd/system/rsyslog.service
$ cat /usr/lib/systemd/system/rsyslog.service.d/wait-for-mount.conf
# Ensure our temp disk for log storage is ready to go before starting
[Unit]
After=mnt-resource.mount
Mystery solved and fixed for the future!
Postscript - Is this really fixed?
Coming back to this draft ahead of hitting publish and it turns out the fix might not be as straight forward in all cases. Let me explain a little…
The Azure temporary disk mount
appears to be a little different to normal disk mounts. waagent.service
is responsible for the
mount (not /etc/fstab
, which will just fail for this case), however setting After=waagent
in your unit config doesn’t resolve this issue as
systemd
considers the waagent
service running as being the same as ready.
I had wondered about using systemd
auto-restart on rsyslogd
with a Condition
line based on the mount directory, but if the condition isn’t met it won’t then
try to load it again (it’s different to having a process fail and get auto
restarted).
Some online suggestions are along the lines of having a script that checks for the mount on a timer
and then kick starts the required service. I decided to
go for a simpler hacksolution and just set an ExecDelay=30s
in the
rsyslog
extra unit file. It does the job and considering the wider processing
happening here is temporary anyway it doesn’t seem worth creating
more complexity.