4.10 kernel 有一個有趣 perf 的patch,主要用來改善cache contention的偵測[1],尤其是對false sharing的判斷。False sharing發生時,原本預期可以利用多核平行處理來達到加速的程式碼片段,往往會跑得比單核還慢。
舉例來說,如果有以下程式:
struct foo {
int x;
int y;
};
static struct foo f;
/* The two following functions are running concurrently: */
int sum_a(void)
{
int s = 0;
int i;
for (i = 0; i < 1000000; ++i)
s += f.x;
return s;
}
void inc_b(void)
{
int i;
for (i = 0; i < 1000000; ++i)
++f.y;
}
開兩個thread讓sum_a()與inc_b()跑在不同CPU上,乍看之下各自讀寫的address不同,應該可以獨立執行,但由於cache coherence機制以cache line為單位,所以sum_a()每次讀f.x時,CPU都可能會發現在f.x的那條cache line 是dirty(因為inc_b()有更新f.y),所以就需要花費時間re-read,但是sync進來的資料其實sum_a()根本不會用到。[2]
Perf c2c是一套Red Hat工程時發展了蠻長時間的工具[2],最近被收進了4.10[1],可以方便觀察這種行為。該團隊工程師有一篇文章,step by step教怎麼用[3]
案例:Kernel 中的RCU效能也曾經被false sharing影響過,修正方式就是…讓percpu data cache aligned:
commit 11bbb235c26f93b7c69e441452e44adbf6ed6996
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date: Thu Sep 4 12:21:41 2014 -0700
rcu: Use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data
The rcu_data per-CPU variable has a number of fields that are atomically
manipulated, potentially by any CPU. This situation can result in false
sharing with per-CPU variables that have the misfortune of being allocated
adjacent to rcu_data in memory. This commit therefore changes the
DEFINE_PER_CPU() to DEFINE_PER_CPU_SHARED_ALIGNED() in order to avoid
this false sharing.
Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c0673c5..ab6fcfb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -105,7 +105,7 @@ struct rcu_state sname##_state = { \
.name = RCU_STATE_NAME(sname), \
.abbr = sabbr, \
}; \
-DEFINE_PER_CPU(struct rcu_data, sname##_data)
+DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data)
更詳細的false sharing文獻可以參考。[5]
留言
張貼留言