# zfs +cache taking too much ram



## chrcol (Jan 30, 2011)

I have been trying to fix a problem on a server for a while, trying various zfs tunable with varying success.

2 notable problems are, is weak on myisam mysql performance, mostly fixed by disabling zfs prefetch, and it collapses during system backups which are run nightly.  These backups involve tarballing entire home directories and databases.  I dont want this thread to be a debate on whats more efficient for backups thanks.

Disabling prefetch damages other file access especially the backups, so tuning for one thing damages another, is a big oversight I think to make prefetch a system only variable.

The server is 8.1 with zfsv15, amd64 8 gig of ram (another 4 gig on the way).

Last night I picked up on whats going wrong since every morning there is excess swap been utilised.  I started the backups manually, and have graphed the data for memory usage.  When the backups started zfs ARC was shrinking and shrinking slowly and 'Inact' usage was growing, cacti reports this as cache, this carried on until the server choked.  It choked at around 2.8gig Inact usage (grown from just 700 meg) and zfs at that point had dropped from 3 gig to 1.5gig, at that point was no swap usage.  Then there is a sudden spike in processes and then too much strain on ram (cache not releasing fast enough) and so loads of swap used and the server choked going offline for a while until it sorted its ram usage out.  When cacti was able to connect again, the cache had been flushed and ram usage was very low (4 gig free unused) so obviously at that point the Inact had been dumped to disk and the server slowly went back to normal.

It would seem from observation the following is happening.

1 - Memory management is not that well handled auto as some say as here, http://forums.freebsd.org/showthread.php?t=9980.  I wonder if there is a way to cap Inact usage like zfs ARC can be capped.
2 - ZFS has 2 caches? ARC for read and normal Inact for writes?
3 - Suggestions on how to resolve this, I will provide what info is asked for.  As I said please dont remind me tarballing is inefficient, I know this but is how the installed software control panel manages backups.

The same system on UFS did lag during backups but I am talking about a few seconds lag not causing the system to collapse on itself.  ZFS is in a 2 drive mirror setup, I was going to add 2 new drives so dual mirror to the root tank but seems it isnt supported on root zfs, likewise a log caching device isnt supported either.


----------



## da1 (Jan 30, 2011)

Hi,

1) I think this is a ZFS issue because "The ARC grows and consumes memory on the principle that no need exists to return data to the system while there is still plenty of free memory. When the ARC has grown and outside memory pressure exists, for example, *when a new application starts up, then the ARC releases its hold on memory*." This is the problem from my point of view because the darn thing just doesn't work like that. I for one use 
	
	



```
perl -e '$x = "x" x 1000000000;'
```
 in a sh loop script, every 45 seconds, to clear the Inact mem.
2) Only 1, as far as I can tell from my short experience and that is the ARC, which is stored in kernel space ("ZFS caches data in kernel addressable memory"). It uses Inact to "move" data from 1 place to the other. For instance, when you do a mv, cp, dump, tar operation, it goes to Inact (stays there until the RAM is full and other apps need it) and then it goes to the destination.
3) If you find a solution, I wanna know it too 

How did you go about having V15 ZFS on a 8.1 FreeBSD ?

Good reading:
ZFS Evil tunning guide
ZFS Best Practices
Some similar issues; discussed here on the forum


----------



## chrcol (Jan 30, 2011)

many thanks da1.

I did find this which looks remarkably similiar.

http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/146410

It does seem data is been cached twice, inactive according to the docs is cached data thats not yet flushed.  So basically a write cache, I have always considered ZFS to have its own cache and every other zfs server I have access to the inactive cache does not grow it stays the same size.

From what I see ARC goes in wired and not inact memory.  On my graphs the ARC was shrinking whilst inact was growing, so they were fighting with each other for RAM which inact was winning.

I ran that command and seemed to have little effect 

Mem: 1295M Active, 974M Inact, 3497M Wired, 3588K Cache, 2119M Free
Swap: 8192M Total, 8192M Free

thats the top afterwards and was about the same before.  The zfs ARC is 3 gig which as you can see fits in with the wired usage.

For zfsv15 I used the patch posted on the mailing lists in december however that patch is buggy so dont use it  I wont post it for safety.  This server will be updated to 8.2 after release so it has the zfsv15 code cleanly integrated.

Ok so I have reread that thread, it would appear this high inact is supposed to be for ufs cache? making my situation bizarre as its a zfs root server with no ufs drives.  The perl command I ran again with a higher number for 3 gig instead of 1 gig and it did so something, it reduced my wired (ZFS ARC) not the inactive   So to me ZFS arc seems to shrink ok but the inact does not.


----------



## da1 (Jan 30, 2011)

Hmm ... now that's weird. 
On an UFS/ZFS setup the bug makes it's presence well known, but on a full ZFS layout, I have no problems.


----------



## chrcol (Jan 30, 2011)

Yeah its defenitly odd, all day long the inactive is fairly static, moves up and down a little but not much, but when the tar runs it shoots through the roof.

Also there is a gap on cacti graphs from when I ran the perl command, there was no swapping but it caused downtime on the server it seems.


----------



## da1 (Jan 30, 2011)

How much data are you tar'ing ?
What's vm.kmem_size_max's value in sysctl ? Are you setting it manually or not ? I have it set for 512M and the wired memory does not exceed 670MB and I just tried tar'ing a 700MB movie.


----------



## chrcol (Jan 30, 2011)

I think this has turned out to be a hardware problem.

The server just almost hung, lots of processes in zfs state and not killable, reboot failed on shutdown script.

Forced the reboot and then on bootup was very very slow.

Disabled 2nd hdd in bios, bootup still very very slow.

Disabled 1st hdd in bios, reenabled 2nd hdd and got met with a cannot locate zfs pool message, and it seems I got this bug ouch.

http://www.freebsd.org/cgi/query-pr.cgi?pr=148655

As the code is 8.1 release, fits in with that timeline.  (is not dec 2010 code as earlier stated).

Rescue disc in server is zfsv14 so for now I cannot do much else I think.  The boot is normal up until it mounts root and starts the network but when starts starting services its extremely slow and even after 10 minutes is not done, sshd is started but times out.

For the tuning the kmem was set to 12gig with ARC capped to 3 gig.


----------



## da1 (Jan 31, 2011)

I can see you're having the time of your life with this server )

The bug can be patched, just search the forums for it.


----------



## chrcol (Jan 31, 2011)

it was a too low maxvnodes, I did what others suggested in that other thread to try and reduce inact cache.  Seems it took some hours after I left it to take affect, and then was breaking every boot, I noticed first when I booted in single user mode was fine.  So now thats increased it boots fine again, although I still now have a low maxvnodes it is not as low as was suggested in that thread.

I learned about that horrible bug tho where cannot boot when a hdd removed.  So we updating this to 8.2 now to resolve that.  In addition it seems when we add the ufs drive for backups we going to get this ufs cache bug for that as well? or is that fixed in 8.2?


----------



## da1 (Jan 31, 2011)

chrcol said:
			
		

> I learned about that horrible bug tho where cannot boot when a hdd removed.


you mean the "cannot boot from 2nd disk" bug ? There is a patch for it and I'm using it on an amd64 8.1-release.



> So we updating this to 8.2 now to resolve that.


If the bug is the aforementioned one, the upgrade is not needed if you apply the patch.



> In addition it seems when we add the ufs drive for backups we going to get this ufs cache bug for that as well? or is that fixed in 8.2?


I know the "cannot boot from 2nd disk" bug is fixed on 8.2 but no idea about the UFS/ZFS one.


----------



## chrcol (Jan 31, 2011)

you overplay the patch.

I tried the patch and it doesnt compile, it complained about a missing file. http://forums.freebsd.org/showpost.php?p=121526&postcount=25
One other person in that thread also couldnt compile. http://forums.freebsd.org/showpost.php?p=114334&postcount=17
A 2nd person mentioned it needs a dependency compiling first (without instructions). http://forums.freebsd.org/showpost.php?p=105725&postcount=14

To me a patch that doesnt compile following the listed instructions then becomes too risky and as such it is safer to upgrade to a code base that has the patch integrated safely.

Also the fact the server is using a zfsv15 patch that has since been discovered to be broken also justifies upgrading the code base.

On the ufs/zfs one I will manually check the src file to see if the patch changes exist, and will get back to you on this.

Ok the zfs/ufs patch isnt in 8.2 (dissapointed), so 8.2 been shipped with a known bad bug 

Incidently do you know how to get patches cleanly of the pr page? they all have extra characters in the code added.


----------



## da1 (Jan 31, 2011)

Here are my own instructions:


```
# Path 8.1-release (use "gpart show" for the disks) - http://forums.freebsd.org/showthread.php?t=16535
http://people.freebsd.org/~mm/patches/zfs/head-zfsimpl.c.patch <- path for 8.1 ZFS bootloader problem

cd /usr/src
fetch http://people.freebsd.org/~mm/patches/zfs/head-zfsimpl.c.patch
patch -p0 < head-zfsimpl.c.patch
make -j4 buildworld
cd /usr/src/sys/boot/i386/zfsloader
make install
cd /usr/src/sys/boot/i386/gptzfsboot
make install
gpart bootcode -p /boot/gptzfsboot -i 1 ad4
gpart bootcode -p /boot/gptzfsboot -i 1 ad6
```

You need to build the world, or the patch will not build (you do not need to install the world).

For your setup, however, I'm a bit reluctant to speculate if it's gonna work or not (you have V15).


----------



## chrcol (Jan 31, 2011)

is fixed now (the boot issue).

Am on 8.2-RC3 on that server now, will see how the behaviour is now on the tarballs, as I checked the src code and a fair few patches are submitted in there.  But not the ufs memory issue patch which incidently looks quite old now, prepped for 8.0-STABLE.  So i wouldnt feel confident using that on 8.2 code.


----------



## chrcol (Feb 8, 2011)

ok an update.

since changing to 8.2, I have not seen any zfs throttles, which is good and in terms of the server hanging from high i/o thats almost gone.

However inact remains uncontrollable.

With just using zfs there is a large jump in memory demand when tarballing finishes, so during tarball is ok but when ends it seems to want to dump all the data in ram or something as ram usage jumps up on inact, and zfs cache is compromised.

Yesterday we tried doing a backup to a ufs disk and came across the bug you linked to, I had a 6gig of inact and 1 gig of buffer.  This would not go down until I killed the scp process (was sending tarballs to remote server).

so zfs cache behaves itself, ufs cache is rampant.


```
last pid:  3277;  load averages:  1.13,  1.22,  1.25    up 3+13:15:07  15:43:11
269 processes: 2 running, 266 sleeping, 1 zombie
CPU:  4.7% user,  0.0% nice,  6.2% system,  0.4% interrupt, 88.7% idle
Mem: 1323M Active, 6091M Inact, 3984M Wired, 341M Cache, 1236M Buf, 126M Free
Swap: 8192M Total, 8192M Free
```

thats after I flushed the swap, there was about 240meg of data in swap. wired should be 4.5gig in normal conditions and about 3-4 gig free ram.

What I want is either.

1 - zfs cache to take priority over ufs cache, so if not enough room for both, ufs slows down loses its cache etc.
2 - ufs cache able to hard cap it. or even a hard cap on inact use.

In addition cache should never ever make data swap out. cache should always be lost before swapping.


----------



## chrcol (Feb 9, 2011)

I set a min ARC size in loader.conf to partially resolve this so now ufs doesnt grow so big as zfs holds its own, however ufs caching INACT still causes some swap.


----------



## chrcol (Feb 10, 2011)

I am guessing I should do a new PR, seems having 3.5gig free ram isnt enough for ufs, it is using 1.2gig of swap right now to create tarballs. (on top of 3.5gig free ram)


----------



## chrcol (Feb 16, 2011)

It has taken a new twist today.  UFS has put so much demand on ram that ZFS has reduced its arc to below the "vfs.zfs.arc_min" setting. It's set to 3G and currently I have the following.

ZFS ARC 2.12 gig ram usage (below the setting)
UFS cache 9.65 gig ram usage, (drive only used for backups at night is using 4/5 of physical ram)
470 meg swap
12 gig of total RAM

I see posts on mailing lists with similar concerns with just 1 reply of a dev stating he is unaware of ufs being greedy.

Deleting the files of the UFS drive frees the ram immediately.


----------



## sub_mesa (Feb 22, 2011)

chrcol said:
			
		

> What I want is either.
> 
> 1 - zfs cache to take priority over ufs cache, so if not enough room for both, ufs slows down loses its cache etc.
> 2 - ufs cache able to hard cap it. or even a hard cap on inact use.
> ...


You described very accurately of what I believe should be done. Even better would be run-time configurable limits via sysctl. 

The workaround for now would be a 100% ZFS system with Root-on-ZFS. Both the mfsBSD and ZFSguru projects have .iso images that can perform a Root-on-ZFS installation.


----------



## chrcol (Feb 24, 2011)

It was originally 100% zfs, but again suffered from a lack of configuration.

The backups without prefetch lag the entire system horribly, i/o gets pegged at 100%, however if enabled prefetch then normal server load was excessive as things like mysql suffer a lot with it on. solution? Allow prefetch to be toggled per fileset. And ideally also without needing a reboot.

What I have done now is capped nbufs to just 64meg, as zfs doesnt seem to use that and was autotuned to 1.2gig after started using the ufs drive, since then cache has also started behaving itself.  In addition on the backups been sent to remote server, this was also causing excessive data to be cached, deleting the file flushed the cache but I also found out unmounting the ufs drive also flushes it so I modified my backup script to unmount and remount the ufs drive after every file is read.  Is messy but works.  Finally I disabled soft updates, no idea if this has had an affect on cache demand, but performance isnt needed on the ufs drive so I figured if I turn that off there is less strain on the memory subsystem.

It is worth noting I have seen poor memory management on UFS only systems as well (excessive caching and swapping), just the problem in that scenario tends to have less impact as its not fighting with another filesystem but rather user processes.  The fact perl command does immediatly make it free up the ram (only after zfs first gives it up tho) suggests the mechanism exists for UFS to let go of the ram but someone somewhere made a concious decision for it be aggressive.

I think the blame was wrongly pointed at zfs for this issue in previous discussions, when the system was zfs only, I never seen it swap until zfs had already let go of all its cache first.


----------



## Crivens (Feb 24, 2011)

The problem is the usage of the vnode cache in the FreeBSD kernel. This is not a problem in itself, but the problem is that ZFS does not use it. The kernel keeps the memory pages around as long as possible (inac), as long as they are associated with a vnode. Unmounting removes the vnode, makes it inaccesible, and thus frees the memory. Now when the free memory starts to be low enough, ZFS frees ARC space. Only that is then immediately claimed by the vnode cache. 
Solution might be to shrink the ARC not by the free memory but by free+inac. Anyway, the ARC is compressed to much for performance while inac memory lies around to be used.

IMHO, the vnode cache is superior to the ARC because it also caches the assignment of the pages to vnodes and the reactivation of memory is faster than first asking the ARC. Better make ZFS use the vnodes. But that is my opinion. The memory management is very good, and seeing memory being used for swap could mean that the memory is allocated but has not been accessed in some time, like the web browser hanging around dormant for some time. In that case, the memory is better used to cache some busy files and bring it back in again when needed.

HTH


----------



## chrcol (Feb 24, 2011)

Believe me its not swapping only idle data, performance grinds to halt as it swaps many 100s of megs of data if not gigs of data whilst preserving UFS cache.

I have seen it swap over 2 gig of ram whilst holding 6 gig of data in cache.  Unmounted the ufs drive and I suddenly had 6 gig of free ram, problem there?

Interesting on the vnodes but when I set max vnodes to a very low number I noticed 2 things.

1 - it didnt limit the INACT or specifically the ufs cache size.
2 - zfs became unstable, which means it has some kind of relation to vnodes.  When I set it very low the server wouldnt even boot.

Incidently there is a sysctl to control swapping idle processes, which defaults to 0, if the VM subsystem swaps idle stuff anyway to favour preserving filesystem cache then that sysctl perhaps is misleading.


----------



## Crivens (Feb 25, 2011)

chrcol said:
			
		

> Believe me its not swapping only idle data, performance grinds to halt as it swaps many 100s of megs of data if not gigs of data whilst preserving UFS cache.


This should not happen. Is the machine starved for IO or really starved of memory?
What I have seen, the ARC gets pushed together to it's minimum and then the ZFS is slow as snails. 


> I have seen it swap over 2 gig of ram whilst holding 6 gig of data in cache.  Unmounted the ufs drive and I suddenly had 6 gig of free ram, problem there?


These 2G could indeed be older than the file content in the cache, so it could be fully correct to swap it out. By unmounting you also destroy all references to the cached vnodes, which then drop all the memory they are holding.


> Interesting on the vnodes but when I set max vnodes to a very low number I noticed 2 things.
> 
> 1 - it didnt limit the INACT or specifically the ufs cache size.
> 2 - zfs became unstable, which means it has some kind of relation to vnodes.  When I set it very low the server wouldnt even boot.


 Limiting the number of vnodes will not limit the size of memory each one can hold. One vnode could denote a backup file of maybe TBytes, which then of course would try to be cached.


> Incidently there is a sysctl to control swapping idle processes, which defaults to 0, if the VM subsystem swaps idle stuff anyway to favour preserving filesystem cache then that sysctl perhaps is misleading.


There is a difference between swapping and paging. The syscall seems to be for real swapping, not paging. Please correct me on this if I am wrong.
Paging means that single memory pages are written to disk, swapping means that whole processes will be written out and will be swapped in again in maybe several seconds. That happens when paging does not make free memory fast enough, iirc.


----------



## chrcol (Feb 27, 2011)

> This should not happen. Is the machine starved for IO or really starved of memory?
> What I have seen, the ARC gets pushed together to it's minimum and then the ZFS is slow as snails.



Probably both, I have never seen a server that's swapping many hundreds of MBs of data or more not have a performance impact and can often be crippling regardless of filesystem, generally if I see a server with more than few MBs swapped at any time I consider it a problem.  ZFS of course crawls when ARC is low, so this then further compounds the problem.  In addition it was reducing ARC to below the minimum arc setting.  When this swapping occured ZFS arc was already reduced to tiny amounts and free ram was 0 or close to it, however cache, buffer and inact remained high.



> These 2G could indeed be older than the file content in the cache, so it could be fully correct to swap it out. By unmounting you also destroy all references to the cached vnodes, which then drop all the memory they are holding.



This is probably the case, but it needs to be more intelligent than age, if I am reading data only once, I don't want it cached over data that will be read again.  This is a good reasoning for needing manual control of caching.  It's been provided for ZFS but UFS remains auto tuning only.



> There is a difference between swapping and paging. The syscall seems to be for real swapping, not paging. Please correct me on this if I am wrong.
> Paging means that single memory pages are written to disk, swapping means that whole processes will be written out and will be swapped in again in maybe several seconds. That happens when paging does not make free memory fast enough, iirc.



Don't know, but I have probably got it wrong as it doesn't work as I would expect it to.

Incidently it's still behaving itself now for days after I capped the buffer very low and disabled soft updates.


----------

