Present Location: News >> Blog >> Updates & C-States Silliness on Linux

Blog

> Updates & C-States Silliness on Linux
Posted by prox, from Seattle, on October 11, 2016 at 02:25 local (server) time

I haven't blogged in awhile.. so here are some updates.

I recently moved dax, which is the FreeBSD-based host that powers this web server, to a VM and away from Internap's Agile service.  dax used to be a dedicated server hosted by Voxel dot net and provided IPv6 connectivity to the rest of my network since September of 2005.  However, things started going downhill when Internap acquired them in 2012.  Technical support turned into a horror story and they recently indicated to me (in an IPv6 support ticket from hell) that the Agile "legacy" services were going to be done away with later in 2016 (not sure if this is true).

I migrated all of my sites off the IPv6 PA /56 I had from Internap and onto my own PI /44.  I moved dax to a VM on excalibur, a dedicated server hosted by Choopa, LLC.  I'm now Internap-free and running everything off of AS395460.

Anyway, excalibur's got an Intel Xeon E3-1240 v2 CPU with 4x cores and should run at 3.40 GHz.  Normally, with servers of mine that run VMs, I instruct cpufreq-set(1) to run all cores with the performance governor, which is supposed to direct all cores to run at their maximum clock rate.  However, with the E3 Xeon CPUs, this doesn't work.  The clock frequencies of each oscillate based on load and cause a considerable latency penalty, which is fairly visible in packet forwarding (causes jitter).  Here's a snapshot:

(excalibur:2:12:EDT)% cat /proc/cpuinfo|grep -i MHz
cpu MHz		: 1714.742
cpu MHz		: 2652.000
cpu MHz		: 3563.492
cpu MHz		: 3799.898
cpu MHz		: 3722.070
cpu MHz		: 2126.195
cpu MHz		: 3800.164
cpu MHz		: 2313.062

Searching on the web indicated I would need to completely disable the C-states in the BIOS, something I wasn't really willing to do and didn't feel like the correct solution.  I then came across another post that indicated I should write "0" to /dev/cpu_dma_latency but keep the file open.  So, I did this:

(excalibur:2:12:EDT)# cat > /dev/cpu_dma_latency
0

.. and didn't hit ^D.  Sure enough:

(excalibur:2:12:EDT)% cat /proc/cpuinfo|grep -i MHz
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882
cpu MHz		: 3599.882

Seriously?  There's a whole nice structure in /sys/devices/system/cpu/cpufreq/* that Linux has used for over a decade to control CPU frequency scaling and we've now a one-off /dev character device that controls such things on a modern CPU?

Well, considering the stupidity of systemd that's supposedly accepted by most Linux distributions now, I guess I shouldn't be surprised that this type of hacky interface exists.  At least I've got a way to emulate the behavior of the performance governor without mucking with BIOS settings.  It turns out this also works on my E3 1245 v3 Xeon I've at home in vega, too.

Update: This is a bad idea.  This causes all CPUs to run very hot even when utilization is low, which confuses me:

excalibur burning

Update: The right solution to this problem is to just disable the Intel P-states driver by passing intel_pstate=disable on the kernel command line.  The acpi-cpufreq driver is used instead and operates the way it should.  See comments for more information.

Comment by Tommy on October 11, 2016 at 13:33 local (server) time

Hey Mark, how about ?

echo performance > /sys/devices/system/cpu/cpuX/cpufreq/scaling_governor

where X is the id of the cpu core.

CPU is: Intel(R) Xeon(R) CPU E3-1241 v3

Comment by Tommy on October 11, 2016 at 13:35 local (server) time

with :cat > /dev/cpu_dma_latency" I can also reproduce the heating issue but the manual governor setting seems to work without a hiccup.

Comment by Mark Kamichoff [Website] on October 11, 2016 at 13:35 local (server) time

Unfortunately, that does not work on the E3 Xeons, which is why I started looking around to begin with.  As mentioned above, I run all cores using the "performance" governor, which still results in clock speed oscillation.

Comment by Tommy [Website] on October 11, 2016 at 14:10 local (server) time

Which kernel version are you using? My cpu is very similar, also a Xeon E3 as I've posted before.

root@xxx:~# uname -a
Linux xxx.xxx.se 4.1.27-ovpn-grsec #2 SMP Sat Jul 2 03:02:29 CEST 2016 x86_64 GNU/Linux

What if cpufreq-set(1) misbehaves? Did you try the fully manual method?

Comment by Mark Kamichoff [Website] on October 11, 2016 at 14:28 local (server) time

4.6.0-1-amd64.  cpufreq-set(1) just writes "performance" to scaling_governor, from what I can tell.  Doing it manually makes no difference (I tried).  The hierarchy is a little different between kernel versions, but:

(excalibur:14:26:EDT)% ls -la /sys/devices/**/scaling_governor    
-rw-r--r-- 1 root root 4096 Sep 18 21:34 /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy2/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy3/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy4/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy5/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy6/scaling_governor
-rw-r--r-- 1 root root 4096 Sep 18 21:21 /sys/devices/system/cpu/cpufreq/policy7/scaling_governor
(excalibur:14:26:EDT)% cat /sys/devices/**/scaling_governor  
performance
performance
performance
performance
performance
performance
performance
performance
(excalibur:14:26:EDT)% grep MHz /proc/cpuinfo
cpu MHz : 1600.125
cpu MHz : 3800.031
cpu MHz : 3800.828
cpu MHz : 3094.664
cpu MHz : 3717.554
cpu MHz : 1770.656
cpu MHz : 3783.429
cpu MHz : 1963.367

Comment by Tommy [Website] on October 12, 2016 at 07:14 local (server) time

Hey Mark,
Try to change the scaling driver
echo acpi-cpufreq >/sys/devices/system/cpu/cpufreq/X/scaling_driver

I think intel_pstate is causing the issue.

Comment by Mark Kamichoff [Website] on October 12, 2016 at 12:54 local (server) time

You might be on to something.  I can't change the driver at runtime, though:

(excalibur:12:51:EDT)# pwd                              
/sys/devices/system/cpu/cpufreq/policy0
(excalibur:12:51:EDT)# cat scaling_driver                
intel_pstate
(excalibur:12:52:EDT)# echo acpi_cpufreq > scaling_driver
zsh: permission denied: scaling_driver
(excalibur:12:52:EDT)#

I think I may need to disable it via the kernel command-line at boot with:

intel_pstate=disable

I'll give this a try during the next reboot, thanks!

Comment by Tommy [Website] on October 12, 2016 at 13:58 local (server) time

Alright, if that doesn't work, try appending this to Grub's cmdline:
(/etc/default/grub in case of Debian-derivatives)
GRUB_CMDLINE_LINUX_DEFAULT="cpufreq_driver=acpi-cpufreq"

Comment by Mark Kamichoff [Website] on October 12, 2016 at 14:09 local (server) time

This worked like a champ on one of my boxes.  With intel_pstate disabled, the kernel chooses acpi-cpufreq automatically.

(vega:11:08:PDT)% cat /sys/devices/**/scaling_governor
performance
performance
performance
performance
performance
performance
performance
performance
(vega:11:08:PDT)% grep MHz /proc/cpuinfo
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
cpu MHz : 3401.000
(vega:11:08:PDT)% sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +105.0°C)
temp2:        +29.8°C  (crit = +105.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +62.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +56.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +62.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +58.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +55.0°C  (high = +80.0°C, crit = +100.0°C)

(vega:11:09:PDT)% cpufreq-info|grep driver
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
 driver: acpi-cpufreq
(vega:11:09:PDT)%

Thanks, Tommy!

Comment by Tommy [Website] on October 12, 2016 at 20:51 local (server) time

Glad to hear! ;)


> Add Comment

New comments are currently disabled for this entry.