# 'du' behavior



## ASX (Jun 28, 2016)

hello,

I'm experiencing a questionable output from du(1)

the script below show the effect, at least on my system:

```
rm -rf test
mkdir test
cp /boot/kernel/kernel test/kernel1
cp /boot/kernel/kernel test/kernel2
cp /boot/kernel/kernel test/kernel3
cp /boot/kernel/kernel test/kernel4
cp /boot/kernel/kernel test/kernel5
cp /boot/kernel/kernel test/kernel6
cp /boot/kernel/kernel test/kernel7
cp /boot/kernel/kernel test/kernel8
ls -l test
du -s test/*
```

and the 'du' output is not what expected:


```
total 64876
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel1
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel2
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel3
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel4
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel5
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel6
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel7
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel8
21153   test/kernel1
21153   test/kernel2
21153   test/kernel3
1417   test/kernel4
1   test/kernel5
1   test/kernel6
1   test/kernel7
1   test/kernel8
```

The test was run on a zfs filesystem, if that matter.

Any comment ? bug or feature ?


----------



## SirDice (Jun 28, 2016)

First guess would be the filesystem being full during the copy of kernel4.


----------



## kpa (Jun 28, 2016)

This is on ZFS so it's very unlikely that the / dataset got full unless there are quotas in place.


----------



## ShelLuser (Jun 28, 2016)

This script might be easier:


```
#!/bin/sh

mkdir testing

for a in `seq 1 8`; do
        cp /boot/kernel/kernel testing/kernel$a;
done;

ls -l testing
du -s testing/*
```
I don't think this is a bug but a timing issue. If you issue the `du -s $dir/*` command after the script finished you'll get the normal output you'd expect. Therefore I can only conclude that when the script is running some data is still being processed (probably cache related) which would explain the difference.


----------



## ASX (Jun 28, 2016)

no the filesystem is 11%  used, 400 GB free, and there are no quotas.

And yes, it seems the result depend onthe fact the the write operation are queued and the blocks aren't effectively written at the time the du command is run.

Waiting a couple of seconds and repeating the `du -s test/*` command lead to the expected result.
That's why I wrote 'bug or feature ?'.

But honestly to me look more like a bug.


----------



## ShelLuser (Jun 28, 2016)

In the mean time I can confirm that on my systems this behavior happens on ZFS but not on UFS.


----------



## kpa (Jun 28, 2016)

Is there compression involved? As far as I know ZFS writes in "lazy" fashion when compression is used and it can take few seconds until the file properties reported by stat(2) syscall match what is actually on disk.


----------



## ASX (Jun 28, 2016)

kpa, no compression involved.


----------



## kpa (Jun 28, 2016)

I haven't used ZFS in a while but this could be a feature of it that you just have to live with. ZFS pushes many boundaries already and exposes problems in UNIX APIs, this could one of them. One famous example is running df(1) on a system that uses ZFS, the numbers you get for free space do not match in any you would expect because the free space reported for each dataset is actually the free space of the whole pool.


----------



## ShelLuser (Jun 28, 2016)

kpa said:


> One famous example is running df(1) on a system that uses ZFS, the numbers you get for free space do not match in any you would expect because the free space reported for each dataset is actually the free space of the whole pool.


Not always true:


```
breve:/home/peter $ df -lh |grep zroot
zroot                            8.4G    2.8G    5.7G    33%    /
zroot/home                       6.9G    1.2G    5.7G    18%    /home
zroot/src                        6.6G    903M    5.7G    13%    /usr/src
zroot/var                        6.9G     30M    6.9G     0%    /var
zroot/var/db                     7.0G    151M    6.9G     2%    /var/db
```

Here you can see that df fully honors my reservation setting for the zroot/var filesystem:


```
breve:/home/peter $ zfs get reservation zroot/var
NAME       PROPERTY     VALUE   SOURCE
zroot/var  reservation  2.44G   received
```

As such the behavior of df is fully consisted here in my opinion.


----------



## kpa (Jun 28, 2016)

Yes but you have to set those reservations (or quotas for that matter) yourself, I was referring to the default behavior.


----------



## ASX (Jun 28, 2016)

Of course, I can live with any behavior once I know about.

And while the df(1) output is logical in the zfs context, cannot say the same for du(1).

Related to the stat(2) syscall, there is to note that ls(1) output is immediately and always correct, therefore I guess that du(1) doesn't use the stat() syscall, and perform a different calculation.


----------



## ASX (Jun 28, 2016)

I filed a bug, PR 210671, let see if it is a "feature"!


----------



## phoenix (Jun 29, 2016)

It's a feature.

ZFS writes data out in chunks, known as *transaction groups*.  By default, this writing happens every 5 seconds.  Thus, depending on how much data is being written out, you may need to wait 6 seconds in order to make sure everything has been written to the pool.

The default used to be 30 seconds.


----------



## ASX (Jun 30, 2016)

The fact that the "writings" are delayed is true practically for every filesystem out there, but that, in my opinion, should not affect the report of 'used blocks'.

As an example in the same domain, an application can read a block from a newly written file, and the fact that the block is already on disc or not is irrelevant. Or should I suppose this is also not true for ZFS ?


----------



## ASX (Jun 30, 2016)

I made an additional test, using `du -A test/*`, the `-A` flag is intended to provide the apparent size of a file, mainly useful when examining sparse files.

In the case discussed here the result should be the same with or without using the `-A` fllag, and it is NOT.

At very least, I can now state `du` ouput is inconsistent with `du -A` output.

NB: I'm deliberately ignoring the `du` implementation details.


----------



## kpa (Jun 30, 2016)

I believe the word "transaction" is the important one here. If ZFS wrote the file metadata (the directory entry etc) first and completely so that the output of `du` would immediately match the size of the file when it's written completely to the disk you would end up with a serious inconsistency if the system crashes before the actual writes to the file have finished. The directory entry would claim that the file is written completely but there would be parts of the file missing and I don't think ZFS (nor do the standard UNIX file handling APIs) has any provisions to record the fact that parts of a file are missing because of such incident. So in summary, ZFS writes only in a way that is atomic and the transactions either succeed completely or are rolled back completely in case of failure.


----------



## ASX (Jun 30, 2016)

I can agree that the transaction management in ZFS could be a possible source of observed du behavior, but I do not agree about "it is a feature".


----------



## ANOKNUSA (Jun 30, 2016)

ASX said:


> The fact that the "writings" are delayed is true practically for every filesystem out there ... I do not agree about "it is a feature".



You can call it a "side-effect," then. It's the expected result of a deliberate design choice. Your script removes a directory; then immediately recreates that directory; then immediately fills that directory; then asks for the space consumed by files in that directory. The shell script is executing these things before ZFS can report accurate information to the kernel, because its copy-on-write and transactional nature dictate that it does not immediately destroy and overwrite data. So it needs time to rewrite pointers, update its uberblocks and ditto blocks, eventually remove the old directory (the one you ordered deleted, which wasn't actually deleted yet), update its own information (such as that reported by `zfs list`), and _then_ report its true capacity and the true size of the files. If you're executing du the instant a copy operation completes, you're not going to get accurate information.

This is not the journaling or write caching many filesystems do. ZFS is fundamentally different from every other filesystem out there. Expect quirks. Learn how it works.


----------



## ASX (Jun 30, 2016)

ANOKNUSA said:


> Learn how it works.



yes, of course, but how that would fix a program `du` that a certain times return an incorrect value and a few second later a correct value ?

Or are you telling me that I have to insert a `sleep(10)` or something like that before to use `du` ? Is that what you mean ?

Something is not quite right.



ANOKNUSA said:


> You can call it a "side-effect



Fine, at very least I believe it should be documented, possibly fixed, if a fix is not possible it would be better to make it refusing to work with ZFS. That's my view.

What I have learned so far is that I cannot rely, and therefore cannot use `du` anymore. Fine too, I will live with that.


----------



## tomxor (Jun 30, 2016)

Are you guys still sure this behaviour is due to ZFS "transactions"?

My first thought was ZFS deduplication, as these are duplicate files and du shows actual block usage rather than file size. I would assume the selective appearance at the time du is called would be due to the process being deferred to not hinder write performance.

You could test the theory by copying a set of more unique files instead.


----------



## ShelLuser (Jun 30, 2016)

ASX said:


> yes, of course, but how that would fix a program `du` that a certain times return an incorrect value and a few second later a correct value?


You're leaving out a very important detail here: _on ZFS_. I tried it myself and it only occurs on ZFS and not UFS. Which, as KPA above me also mentioned, isn't all that surprising because they don't call ZFS "a next-gen filesystem" without reason.

Question being: is there really something to fix here?

I agree that it would help if this caveat got mentioned in the du manualpage, but other than that I don't think there's much to fix here. And if I play the devils advocate I might even be able to argue that there isn't anything to fix:



			
				du manual page said:
			
		

> The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument.


If the file system isn't using any blocks at the time it is run, then how would du be able to fix this in the first place? How would that mark du broken as you hinted at?

It's not du, it's the underlying filesystem. And the manualpage covers for this, as shown above.



tomxor said:


> My first thought was ZFS deduplication, as these are duplicate files and du shows actual block usage rather than file size.


I tested this myself on 2 ZFS powered systems: one without mirroring and one with, both showed the same behaviour. So this isn't a deduplication but a specific ZFS based issue.


----------



## ASX (Jun 30, 2016)

ShelLuser said:


> I agree that it would help if this caveat got mentioned in the du manualpage, but other than that I don't think there's much to fix here.


Thanks, that was what I was hoping to read. (and I though that ZFS only was already implied).

I can now add that I have performed a similar test using OpenIndiana, and the result is similar, definitely ZFS related.

I will also explain how I met the problem: I was comparing mkuzip and mkulzma output/sizes:

```
mkulzma testfile
mkuzip testfile
du testfile*
```

Then I noticed the the testfile.uzip was smaller than testfile.ulzma, that wasn't expected.
Subsequent steps allowed for detection of the erratic `du` behavior.

If I had inverted the two compression commands, I would have not noticed the problem and most likely I would have come up on forum claiming a 100x improvement in lzma compression compared to zip compression.   

Seriously, what may be worrying is that the size error get unnoticed.

Think at using `du -s` on a just created three, to alloc a specific md(4) size ... to copy the three into, you will find yourself in truble without any real error in your code, and additionally with an issue time/load dependent.


----------



## tomxor (Jun 30, 2016)

ShelLuser said:


> I tested this myself on 2 ZFS powered systems: one without mirroring and one with, both showed the same behaviour. So this isn't a deduplication but a specific ZFS based issue.



I thought the deduplication was separate from mirroring... it's just the concept of basic compression at a logical volume level, that's separate from mirroring the entire volume.

I'm asking... I'm new to ZFS, but from what I read about dedup on ZFS it sounds like a separate option, I can't find what the default is for FreeBSD.


----------



## kpa (Jun 30, 2016)

Deduplication has to be explicitly turned on if you want to use it and it's has nothing to do with mirroring.


----------



## ANOKNUSA (Jun 30, 2016)

ASX said:


> Or are you telling me that I have to insert a  sleep(10) or something like that before to use  du ? Is that what you mean ?



Yes. One second was actually enough when I tested it. du is simply being executed after the ZFS userland process finishes, but before ZFS is able to complete its background tasks.


----------



## ASX (Jul 1, 2016)

ANOKNUSA said:


> Yes. One second was actually enough when I tested it.



The problem I see with this approach is to determine *how much time ZFS require to completed its background taks*.

Therefore an arbitrary sleep could not be enough in certain situations, in short a very weak workaround.

Using `du -A` would be an alternative if there is no use of sparse-file.


----------



## ASX (Jul 1, 2016)

I'm adding here, and may be I should open a new PR about ls(1)

`ls -s` show the same behavior of `du`, i.e. may report an incorrect number of used blocks.

The nice things is that both commands works with some other flags, `ls -l` always provide the correct filesize so does `du -A` (other than the sparse-file case).


----------



## phoenix (Jul 3, 2016)

This is why, if you search zfs-related mailing lists, they all recommend ignoring du and df when using ZFS as they won't give you the results you expect.

Instead, use `zfs list`. That will show you the correct information.

If you search the zfs-discuss mailing list archives, there's several long-posts that describe how du, df, zfs list, and zpool list interact and why they show different values. I've also made similar postings to the freebsd-fs list years ago.

Basically, the fs-related tools of the past don't work on the next-gen fs like zfs and btrfs. Stop relying on them, and use the fs-related specific tools.


----------



## ASX (Jul 3, 2016)

I really appreciate the discussion, even if I need to disagree.



phoenix said:


> Basically, the fs-related tools of the past don't work on the next-gen fs like zfs and btrfs. Stop relying on them, and use the fs-related specific tools.



What happen then to shell scripts compatibility ? Assert something like that is likely saying to throw away the old scripts and rewrite all from scratch.

And the worst is the case when you may not notice the problem at all.

What I find surprising, is that from one side 'du' is considered "unreliable", as you suggest, and from another side it is difficult to get it recognized as a bug.



phoenix said:


> Instead, use  zfs list. That will show you the correct information.



If the above is true, I see no reason why the underlying code (zfs command) could not be integrated in 'du' to get the correct result.


----------



## kpa (Jul 3, 2016)

I would be pointless to change du(1) to recognize different filesystems, that's the wrong way to fix the problem. The du(1) command is really nothing else but a simple frontend that trusts that the stat(2) etc. system calls return correct information, it should left as it is. Where the "problem" should be fixed is the VFS layer that works to present all the different filesystems in a uniform way, that's the source of this problem and if there is a workable solution at all it should be done at the VFS layer.


----------



## ASX (Jul 3, 2016)

kpa, I agree.   I wrote "if the above is true".


----------



## ASX (Jul 3, 2016)

To clarify and avoid misunderstandings:

du may be formally correct, in that it show exactly the used blocks at a specific point in time, but the fact is that for any practical purpose this behavior is meaningless.

The use of du has always been about looking for "used blocks", being irrelevant if the blocks have been already written or not because of the underlying I/O subsystem.


----------

