# SSHD performance FreeBSD vs GNU/Linux



## IPTRACE (Dec 21, 2016)

Hello!

After some testing I've found out that my FreeBSD sshd daemon takes more CPU time than other OS (11% vs 6%). The same usage for 20Mb/s and for 90Mb/s.
It doesn't matter from where I start SSH session. Please look at the following information.

#1 OS: FreeBSD 11.0-RELEASE-p3
#2 Other side: GNU/Linux 4.7.9-100.fc23.x86_64

#1: OpenSSH_7.2p2, OpenSSL 1.0.2j-freebsd  26 Sep 2016
#2: OpenSSH_7.2p2, OpenSSL 1.0.2j-fips  26 Sep 2016

#1 CPU: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz x2
#2 CPU: Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz x2

#1 HDD: Samsung SSD 950 PRO 512GB
#2 HDD: SCSI drive (I don't know more)

Is there an issue with software optimization or HDD I/O usage?
Thanks for information.

In comparision to SSHD daemon, I've tested the upload bandwith using IPERF3.
CPU usage was approximately 11%.
I'm going to think there can be the network card driver issue?


```
igb0: <Intel(R) PRO/1000 Network Connection, Version - 2.5.3-k> port 0xf020-0xf03f mem 0xfbd20000-0xfbd3ffff,0xfbd44000-0xfbd47fff irq 50 at device 0.0 numa-domain 1 on pci13
igb0: Using MSIX interrupts with 9 vectors
igb0: Ethernet address:
igb0: Bound queue 0 to cpu 0
igb0: Bound queue 1 to cpu 1
igb0: Bound queue 2 to cpu 2
igb0: Bound queue 3 to cpu 3
igb0: Bound queue 4 to cpu 4
igb0: Bound queue 5 to cpu 5
igb0: Bound queue 6 to cpu 6
igb0: Bound queue 7 to cpu 7
igb0: netmap queues/slots: TX 8/1024, RX 8/1024
```


----------



## tingo (Dec 25, 2016)

Are you comparing cpu time across different operating systems? (As in FreeBSD versus Linux)?
You know that cpu time isn't an absolute value, yes?
In addition, there are many, many differences between operating systems that can account for differences. Examples: schedulers, timers, network stack, measurement infrastructure, and more.


----------



## IPTRACE (Dec 27, 2016)

Yes, I'm comparing.
From business point of view if I have 100% CPU usage instead 50% it's big difference, services are not availbale on some time
From technical point of view I have to provide better hardware and for business again it's more money.


----------



## Jeckt (Dec 27, 2016)

Are you sure the same ciphers are negotiated?  
Does your CPU support AESNI? 
If it does, is the kernel module loaded?
If so, is an AES cipher negotiated?


----------



## IPTRACE (Dec 27, 2016)

Look at CPUs in my first post.
Mine is newer.

Below are negotiating infos.

Connection from my server:

```
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256
```

Connection from the other side:

```
debug1: kex: algorithm: curve25519-sha256@libssh.org
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: curve25519-sha256@libssh.org need=64 dh_need=64
debug1: kex: curve25519-sha256@libssh.org need=64 dh_need=64
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ecdsa-sha2-nistp256
```

Anyway, I've tested tranferring 200MB file from my server using WinSCP to my local PC.
Transfer using scp from 0% to 38% took 100% of my CPU, then only 2%.
Speed was only 6Mb/s max.

But when I've tested from the other server the same file, CPU was 0,3% of CPU and there is processor Core i7-4770 and CentOS as OS with OpenSSH v5.3 .
Transfer was aprox. 30 Mb/s.

I haven't loaded any AES modules yet. I'll check it.


----------



## Jeckt (Dec 27, 2016)

If the null cipher is supported, which I believe must be explicitly enabled; you might be able to determine if it's networking related. Turning TSO on/off for an interface sometimes yields different results.  Chacha20 is the fastest (secure) option without hardware acceleration. I'm not sure if FreeBSD will out perform Linux, but you'll likely find using AESNI drops CPU usage and improves performance. I'm not sure which is faster/better, GCM or CBC, so you'll want to do benchmarks on that.


----------



## IPTRACE (Dec 27, 2016)

I've set it all like described in this guide but openssl does not indicate any speedup.
https://manuth.life/enable-aes-ni-freebsd/

Should I restart the system after load the modules and set it in loader.conf?


----------



## IPTRACE (Dec 29, 2016)

I suppose that AES hardware support is ON by default.
So there is no need to enable module aesni on my Xeon.

Besides, can someone to explain me why my Xeon is slower than Core i5?

Core i5-4200U
`$ openssl speed -evp aes-256-gcm`

```
Doing aes-256-gcm for 3s on 16 size blocks: 39678682 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 28685912 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 15976500 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 4400015 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 640688 aes-256-gcm's in 3.00s
OpenSSL 1.0.2j-freebsd  26 Sep 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     211619.64k   611966.12k  1363328.00k  1501871.79k  1749505.37k
```
Xeon E5-2650 v3
`$ openssl speed -evp aes-256-gcm`

```
Doing aes-256-gcm for 3s on 16 size blocks: 36909490 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 64 size blocks: 24096852 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 256 size blocks: 12140545 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 1024 size blocks: 3396772 aes-256-gcm's in 3.00s
Doing aes-256-gcm for 3s on 8192 size blocks: 545254 aes-256-gcm's in 3.00s
OpenSSL 1.0.2j-freebsd  26 Sep 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     196850.61k   514066.18k  1035993.17k  1159431.51k  1488906.92k
```
Without hardware accelearation.
`% openssl speed aes-256-cbc`

```
Doing aes-256 cbc for 3s on 16 size blocks: 9530510 aes-256 cbc's in 3.01s
Doing aes-256 cbc for 3s on 64 size blocks: 2496327 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 634985 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 160440 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 19866 aes-256 cbc's in 3.00s
OpenSSL 1.0.2j-freebsd  26 Sep 2016
built on: date not available
options:bn(64,64) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: clang
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc      50697.36k    53254.98k    54185.39k    54763.52k    54247.42k
```


----------



## IPTRACE (Dec 29, 2016)

I found that.
http://security.stackexchange.com/q...peed-on-the-same-hardware-with-aes-256-evp-an

It seems the test "openssl speed" use or not AES hardware support (option -evp indocates to use hardware acceleration).
So my statement that AESNI is enable by default is not correct?!


----------



## chrbr (Dec 29, 2016)

Please try `kldload aesni` and observe the output in the console. The output shows if it is supported by the CPU or not. It should be possible to add it to /boot/loader.conf to enable it from start-up if it is not default. It should be a line as

```
aesni_load="YES"
```


----------



## IPTRACE (Dec 29, 2016)

I've done it but the stndard command `openssl speed aes-256-cbc` shows poor performance. Look my post above.
CPU supports AESNI and I had the following information.
`dmesg | grep aes`

```
aesni0: <AES-CBC,AES-XTS,AES-GCM,AES-ICM> on motherboard
```
Do you know if loaded module aesni.ko works immediately or I have to restart the OS?


----------



## chrbr (Dec 29, 2016)

IPTRACE said:


> Do you know if loaded module aesni.ko works immediately or I have to restart the OS?


I am not 100% sure because my board does not support this feature. Since it is loadable a demand for a restart makes no sense. The module should work after it has been loaded. With kldunload(8) you can unload it for benchmarking.


----------



## IPTRACE (Dec 29, 2016)

But when I load module, the command openssl without hardware support show slow performance.
Do you know haw I can test the AES when module is loaded and when is unloaded?


----------



## IPTRACE (Dec 29, 2016)

Anyway, I have still the problem with SSHD process.
When I download a file (200 MB) from the server , first 38% part of the file is uploaded by the server with maximum speed 6 Mb/s, then speed increases to 10 Mb/s and CPU usage is 3%.
Speed is so slow due to 100% of CPU usage.
Wheh I send the same file to server from the client, CPU usage is aprox. 3%.

SSHD processes somehow the file and heavy algorithm takes CPU time?


----------



## chrbr (Dec 30, 2016)

IPTRACE said:


> Anyway, I have still the problem with SSHD process.
> When I download a file (200 MB) from the server , first 38% part of the file is uploaded by the server with maximum speed 6 Mb/s, then speed increases to 10 Mb/s and CPU usage is 3%.
> Speed is so slow due to 100% of CPU usage.
> Wheh I send the same file to server from the client, CPU usage is aprox. 3%.
> ...


I am sure that encryption and decryption takes CPU time. The hardware accelarator should reduce the load from the CPU cores by shifting the load to the dedicated hardware. These days I have played around with encryption using GELI. This is why I have chimed in this thread. During booting the CPU load is clearly higher compared to a boot from an unencrypted disk. This is not unexpected.

A benchmark with a server which reaches 100% load might be difficult to interpret. It might make sense to find out where the limitation comes from. Instead of using a file it might be better to transfer data from /dev/random and let them end up in /dev/null. This should reduce effects related to disk activities on the test results. I am not an expert in networking. I hope other forum members will have additional and better advises.


----------

