# Code optimization when building the system to the processor architecture



## allan_sundry (Feb 7, 2013)

Hi,

I have a server with Intel Xeon CPU E5-2620 2.00GHz on it installed FreeBSD 9 STABLE.


```
CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2000.04-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x206d7  Family = 0x6  Model = 0x2d  Stepping = 7
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  TSC: P-state invariant, performance statistics
```


```
# uname -a
FreeBSD Leon 9.1-STABLE FreeBSD 9.1-STABLE #0 r246404: Wed Feb  6 16:40:12 EET 2013     root@Leon:/usr/obj/usr/src/sys/GENERIC  amd64
```

The system and the ports was rebuilt using Ð¡lang


```
# clang -v
FreeBSD clang version 3.1 (branches/release_31 156863) 20120523
Target: x86_64-unknown-freebsd9.0
Thread model: posix
```


```
# cat /etc/make.conf
CC=clang
CXX=clang++
CPP=clang-cpp
NO_WERROR=
WERROR=
NO_FSCHG=
```

How can I optimize the code when building the system to the processor architecture?

Can I use Clang with the settings in /etc/make.conf "CPUTYPE?=", "CFLAGS=", "CXXFLAGS+=" and "COPTFLAGS="?


----------



## Crivens (Feb 7, 2013)

You can, but be prepared that sometimes 'interesting' things may happen, that the benefit will not be that high, and that no one will help you once you start changing these flags. Setting the cpu type to native is something I did once and have removed _very _soon after.


----------



## allan_sundry (Feb 7, 2013)

Thank you! I read somewhere that in LLVM is the automatic detection of the processor architecture. Can assume that Clang gives the output optimal code without additional flags?


----------



## SirDice (Feb 7, 2013)

allan_sundry said:
			
		

> Can I use Clang with the settings in /etc/make.conf "CPUTYPE?=", "CFLAGS=", "CXXFLAGS+=" and "COPTFLAGS="?


It's best not to muck about with these unless you absolutely know what you are doing.


----------



## allan_sundry (Feb 8, 2013)

Should I expect a performance improvement of the system is compiled Clang compared with the compiled gcc 4.2?


----------



## SirDice (Feb 8, 2013)

Nope. If anything it might even be a little slower. Although I think the differences are negligible.


----------



## bes (Feb 8, 2013)

Comparing the GCC 4.2.1 and LLVM Clang 3.1 compilers :
http://www.phoronix.com/scan.php?page=article&item=freebsd_91_llvmgcc&num=1


----------



## vermaden (Feb 8, 2013)

allan_sundry said:
			
		

> How can I optimize the code when building the system to the processor architecture?


If You are asking these questions, then You definitely should stick to GENERIC, seriously, no offense.


----------



## allan_sundry (Feb 8, 2013)

vermaden said:
			
		

> If You are asking these questions, then You definitely should stick to GENERIC, seriously, no offense.



I'm sorry, but I think you're doing the wrong conclusions. It seems to me that what I am asking questions about optimization should get an answer, and not a recommendation to do nothing. GENERIC is good, but not always. But thanks anyway!

P.S. If you can say something on the subject of optimization, I would be very interested to hear your experience (if you have one).


----------



## vermaden (Feb 8, 2013)

@allan_sundry

I am just sharing my thoughts on that topic, I also wanted to squeeze every CPU tick performance on FreeBSD with reduced kernel config and recompilations with dozens of gcc flags, etc.

All I found, that this is very time consuming, error prone (because of various risky flags) and generally not worth it.

I also done all that, when CPUs were much slower (dual AthlonXP system for example) and even then the gainings were very hard to measure. Now with ultra fast _Nehalems_/_Sandy Bridge_/_Ivy Bridge_ CPUs its even more useless.

I of course understand a case in which I would add more features then GENERIC has, but I no longer see a point in reducing that on fast machines just to gain several ticks.

Various WITHOUT_X=yes in /etc/src.conf may be useful for embedded appliances thought.


----------



## allan_sundry (Feb 8, 2013)

@vermaden
Thank you!

I use as a border router server with two CPU Intel Xeon E5620 2.40GHz (8 cores without Hyper Threading) and network card Intel X520-DA2 running NanoBSD 8.2-RELEASE-p6 amd64. Border router handles up to 3200Mbit and 500kpps symmetric traffic.
I have a second server that was to replace the existing - two CPU Intel Xeon CPU E5-2620 2.00GHz (12 cores without Hyper Threading) and network card Intel X520-DA2 running NanoBSD 8.2 amd64, but the performance did not exceed the performance of the existing server (and was even lower). I tried NanoBSD based FreeBSD 8.3/8-STABLE/9.1/9-STABLE but the performance still bad.
I think that a system with 12 cores is not optimized, and so it has a low performance.


----------



## Crivens (Feb 8, 2013)

You may need to check if the processing power is the bottleneck at all. How is the CPU load on the system?

I would suggest to disable some cores and re-check the performance, maybe the sweet spot is at 6 cores total (2x3) or even 2x1, you never know. Reasons for this might be that the tasks are switched from one core to another core and thus loose the cache locality some times.


----------



## vermaden (Feb 8, 2013)

@*allan_sundry*

From what I remember FreeBSD is 'aware' of how to use HTT, but I would also disable Hyper Threading (logical cores) in the BIOS and try to benchmark it again only with 'true' cores.


----------



## allan_sundry (Feb 8, 2013)

I use only true cores - 8 and 12 is without HT


----------



## vermaden (Feb 9, 2013)

I would post that into *freebsd-stable* and also test CURRENT and post the results on *freebsd-current* on *lists.freebsd.org* with details.

Not all developers use these forums.


----------



## Crivens (Feb 9, 2013)

Could you try to set hw.ncpu to 8 or lower and check again? Because if not all cores are busy the scheduler might push a thread from one core to another, leaving the L1/L2 and maybe L3 cache behind which needs to be refilled again on the new core. This may cause performance loss, but I am not sure how much that would be.


----------



## allan_sundry (Feb 9, 2013)

Crivens said:
			
		

> Could you try to set hw.ncpu to 8 or lower and check again? Because if not all cores are busy the scheduler might push a thread from one core to another, leaving the L1/L2 and maybe L3 cache behind which needs to be refilled again on the new core. This may cause performance loss, but I am not sure how much that would be.



Now unfortunately not be able to carry out the experiment because system is used for customer service. Maybe in the near future it will be possible to conduct an experiment on the new server.


----------



## kenyloveg (Feb 10, 2013)

You may get the best results you ever dreamed, while it may kills bunch of time/brain cells you may not get back again^_^
Just my 2 cent. We always do when we believe, right?


----------



## Crivens (Feb 10, 2013)

allan_sundry said:
			
		

> Now unfortunately not be able to carry out the experiment because system is used for customer service. Maybe in the near future it will be possible to conduct an experiment on the new server.



I just found out that this seems to be read-only (maybe loader can tune it), so no on-the-fly change. But I think there was a way to disable cores in the running system, only what tuneable it was escapes me right now...


----------



## allan_sundry (Feb 10, 2013)

kenyloveg said:
			
		

> You may get the best results you ever dreamed, while it may kills bunch of time/brain cells you may not get back again^_^
> Just my 2 cent. We always do when we believe, right?



Find solutions to problems - it's my hobby!



			
				Crivens said:
			
		

> I just found out that this seems to be read-only (maybe loader can tune it), so no on-the-fly change. But I think there was a way to disable cores in the running system, only what tuneable it was escapes me right now...



Rebooting the system is impossible


----------



## throAU (Feb 11, 2013)

Install with standard build options and monitor CPU usage.

Unless you're pushing extreme amounts of traffic I doubt CPU will be a bottleneck.

The Cisco ASA I have here at work runs a single Celeron 1.6ghz CPU (pre-2007 spec) internally and can still route up to 300 Mbps with stateful packet inspection, or do 170 Mbps of AES throughput.

Going down the path of optimization via compiler flags will mean you may introduce bugs that others will not see, so I'd hesitate to go down that path unless required.

Tweak tunables via sysctls unless you can demonstrate that CPU is an issue and that tunables can't help.


edit:
If rebooting the system is impossible, how do you expect to recompile the OS?

What is the CPU utilisation on the machine when it is not performing up to par?  Has all the obvious been checked?  (NIC cabling, port speed negotiated / set correctly, etc?)

If all that fails, I'd talk to the mailing list and see if you can get hold of the guy who maintains the driver.


----------



## kpa (Feb 11, 2013)

I seriously doubt the any of the standard uses of FreeBSD are such that compiler optimizations are going to make noticable difference. Where it might have an effect is in very specialized enviroment where most of the work is calculations on data that is expressed in form that takes very little space and I/O does not play any role.


----------

