# CPU cores groups



## YuryG (Mar 29, 2017)

Well, I have long standing question. Today we have "Hyper-threading", "semi-cores" and similar technologies that make SMT not really independent through cores. For example, on my AMD FX(tm)-8300 CPU I have 8 "semi-"cores. That is, each of 4 couples share some execution blocks and cache. The same for Hyper-threading from Intel. But I do not see any groupings in `sysctl` (kern.ccpu, to say, is quite plain) or `powerd` or anything. Tasks preempts freely across all CPUs (if not made not to do it explicitly).
So, my "semi-"theoretical question, is there any way to make FreeBSD notice that not full independence of cores?


----------



## User23 (Mar 30, 2017)

Maybe "kern.sched.topology_spec" is the sysctl variable you searching for.


```
sysctl kern.sched.topology_spec
```


----------



## YuryG (Mar 31, 2017)

Yes, I've seen it, but is there any reasonable values to that sysctl (I couldn't even get from it weather to group CPU cores with shared resources or better without them?), or is it used by the FreeBSD kernel sufficiently efficient?


----------



## Terry_Kennedy (Mar 31, 2017)

YuryG said:


> Yes, I've seen it, but is there any reasonable values to that sysctl (I couldn't even get from it weather to group CPU cores with shared resources or better without them?), or is it used by the FreeBSD kernel sufficiently efficient?


The sched_ule(4) scheduler (the default for quite a few FreeBSD versions now) is topology-aware:


			
				manpage said:
			
		

> o   Thread CPU affinity.
> o   CPU topology awareness, including for hyper-threading.


Whether or not your system will benefit from hyper-threading depends on both your workload and the CPU model involved. Different generations / types of CPUs have varying amounts of independence between the virtual CPUs. If your workload doesn't have more simultaneous runnable processes than your system has CPU cores, then more cores (either real or hyper-threaded) won't help. On my systems, I generally disable hyper-threading as they have 8 or more real cores.


----------



## User23 (Mar 31, 2017)

The Bulldozer CPU is "special" with its module blocks. No HTT but 2 cores share 2 integer "clusters" and only 1 floating point unit.
https://de.wikipedia.org/wiki/AMD_FX#/media/File:AMD_Bulldozer_block_diagram_(8_core_CPU).PNG

A 4x thread FPU workload pinned on Core 0,2,4,6 could perform better than 4x threads across all cores hitting the same module blocks again and again. For a jail you could use cpuset() to force the threads on cores you like, but only for userland processes.


----------



## YuryG (Mar 31, 2017)

Yes, I know this speciality. Also the shared L2, L3 cache among module, so preempting task from one module to another will give additional cache miss.
As I see in `top -PIHS`, long living threads jump across all cores... So, only userland "manual" optimization is possible?


----------



## YuryG (Mar 31, 2017)

And this output doesn't make my more optimistic about effectiveness of core scheduling:

```
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="8" mask="ff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="8" mask="ff">0, 1, 2, 3, 4, 5, 6, 7</cpu>
   </group>
  </children>
 </group>
</groups>
```


----------



## Terry_Kennedy (Mar 31, 2017)

YuryG said:


> And this output doesn't make my more optimistic about effectiveness of core scheduling:


What does your system report during boot in the "FreeBSD/SMP" lines? My guess is that it thinks you have 8 full cores, since that is what AMD claims for that CPU.

Here's what the sysctl reports on a dual Xeon X5680 system with hyper-threading off:

```
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="12" mask="fff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="6" mask="3f">0, 1, 2, 3, 4, 5</cpu>
   </group>
   <group level="2" cache-level="2">
    <cpu count="6" mask="fc0">6, 7, 8, 9, 10, 11</cpu>
   </group>
  </children>
 </group>
</groups>
```
Pretty simple - 6 cores in each socket. Presumably it avoids migrating threads between sockets to avoid caching performance loss. Look what happens if I turn hyper-threading on in that same system:

```
kern.sched.topology_spec: <groups>
 <group level="1" cache-level="0">
  <cpu count="24" mask="ffffff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</cpu>
  <children>
   <group level="2" cache-level="2">
    <cpu count="12" mask="fff">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11</cpu>
    <children>
     <group level="3" cache-level="1">
      <cpu count="2" mask="3">0, 1</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c">2, 3</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="30">4, 5</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c0">6, 7</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="300">8, 9</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c00">10, 11</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
    </children>
   </group>
   <group level="2" cache-level="2">
    <cpu count="12" mask="fff000">12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</cpu>
    <children>
     <group level="3" cache-level="1">
      <cpu count="2" mask="3000">12, 13</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c000">14, 15</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="30000">16, 17</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c0000">18, 19</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="300000">20, 21</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
     <group level="3" cache-level="1">
      <cpu count="2" mask="c00000">22, 23</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
     </group>
    </children>
   </group>
  </children>
 </group>
</groups>
```
It has detected a much more complex topology and recognizes the pairs of hyper-threaded logical CPUs in each core.

If your CPU reports it has 8 fully-functional cores and no hyper-threading, then I'd expect to see the output you reported. I don't think this sort of shared-core module thing has been repeated on any other x86-64 CPU family (although I could certainly be mistaken). Some recent ARM CPUs have cores of varying abilities (popular for mobile phones and tablets), but the ARM platform seems fragmented enough that there are separate FreeBSD images for different boards so it is presumably a lot easier to handle special-case CPUs.


----------



## YuryG (Mar 31, 2017)

Surely it reports just 8 cores... But this modules/ cores feature is well known for AMD's Bulldozer...

```
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
 cpu0 (BSP): APIC ID: 16
 cpu1 (AP): APIC ID: 17
 cpu2 (AP): APIC ID: 18
 cpu3 (AP): APIC ID: 19
 cpu4 (AP): APIC ID: 20
 cpu5 (AP): APIC ID: 21
 cpu6 (AP): APIC ID: 22
 cpu7 (AP): APIC ID: 23
```
The question is, may I improve that knowledge of system topology for OS by some tunables, may be?


----------



## Terry_Kennedy (Mar 31, 2017)

YuryG said:


> Surely it reports just 8 cores... But this modules/ cores feature is well known for AMD's Bulldozer...


Yup


> The question is, may I improve that knowledge of system topology for OS by some tunables, may be?


You can force some topologies by setting kern.smp.topology (see /usr/src/sys/kern/kern_smp.c), but the 8-core Bulldozer layout is not one of them. And you might end up hurting performance, since you probably only want to avoid scheduling operations on both cores of a module if those operations use something the module only has one of, instead of one per core (mainly floating-point instructions). There is probably some way to achieve something  like cpuset(4) affinity at the kernel level (the simplistic approach would be to simply force the available CPU mask to 01010101b, though if you have more than 4 runnable threads that will probably kill performance).

Given that there are a lot of those processors out there and I didn't find any discussion since February 2012, this is apparently a complex issue with the potential for a lot of work with small returns. Even Microsoft just provided a pair of hotfixes for Windows 7, deferring further work to a larger rewrite for Windows 8.


----------

