Below is a copy of http://www.jauu.net/data/pdf/beware-of-your-cacheline.pdf
Get cpu info on linux:
cat /proc/cpuinfo
Seems core 2 duo has a 64 byte cache line size
Use prefetch
- Data at contiguous memory locations and in ascending order
- Use for loops that process large amounts of data
- If data in loop is less than a cacheline: unroll loop (by hand to take full control – avoid -funroll-loops)
- AMD PREFETCH and PREFETCHW
- K6-2:
- Small performance improvement – share FSB, low priority
- Don’t add overhead through useless prefetching
- K6-3:
- Separate backside bus • Up to 20% improvements
- 64 bytes per prefetch
- AMD tip: prefetch 192 (3*64) bytes ahead or three cachelines (Prefetc Distance)
- BTW: FSB and AMD is a separate topic (keyword HyperTransport, Memory controller, Direct Connect Architecture )
Usage (AMD):
double a[A_REALLY_LARGE_NUMBER];
double b[A_REALLY_LARGE_NUMBER];
double c[A_REALLY_LARGE_NUMBER];
for (i=0; i<A_REALLY_LARGE_NUMBER/4; i++) {
prefetchw (a[i*4+64]); // will be modifying a
prefetch(b[i*4+64]);
prefetch(c[i*4+64]);
a[i*4] = b[i*4] * c[i*4];
a[i*4+1] = b[i*4+1]* c[i*4+1];
a[i*4+2] = b[i*4+2]* c[i*4+2];
a[i*4+3] = b[i*4+3]* c[i*4+3];
}
memcpy
> movdqa/movdqu (aligned/unaligned) • SIMD extension
• Moves a double quadword from/to mmx register to/from mmx register/memory location
• movdqa: 16 byte aligned or general-protection exception
• Advantages: faster ordinary memcpy for bigger sizes
> p4: prefetchnta, prefetcht0, prefetcht1, prefetcht2
• p3: 32 byte
• p4: 128 byte
> Like madivce(2) it’s a hint and not a command
> prefetchnta • Prefetches data into L2 cache without polluting L1 caches (non temporal)
• SMP: reduce the CPU – cache traffic
> BTW: icc replace memcpy() in string.h with a processor specific version
(handles alignment, tlb priming and prefetching)
MMX, SSE, SSE2, SSE4
> SIMD: single instruction, multiple data
> More then one instruction processed concurrently
> MMX: • Concurrent processing with FP unit -> restore register
• mm[0-7] are aliases for r[0-7] (FPU Register)
• No over- or underflow exceptions (no carry, overflow or adjust flags) like every SIMD instructions
• Wrap-Around or saturation mode!
> SSE2:
• Mainly graphics and codecs precessing
• Added 122 instructions
> SSE3:
• Introduction: Pentium 4
• 13 additional SIMD instructions
• Thread synchronisation (MONITOR, MWAIT)
> SSSE3:
• Xeon 5100 and Core 2 Duo
• 32 new instructions (16 x 64-bit MMX, 16 x 128-bit XMM registers)
> Intel’s Core-Architecture extension: SSE4 (Nehalem New Instructions)
• 50 new OpCodes
• Major improvements:
• String and text processing
• Vectorizing • Hardware CRC32 calculations (crc32)
• New advanced string instructions
• Release: 2008
GCC Attributes
> packed • Specifies a minimum alignment
• Unaligned exception
int x __attribute__ ((aligned (16))) = 0; (align on a 16byte boundary)
short array[3] __attribute__ ((aligned)); (unspecific alignment - largest alignment)
> packed
• Smallest possible alignment
struct foo {
char a;
int x[2] __attribute__ ((packed));
};
struct __attribute__ ((packed)) bar { ... };
• Beware of you architecture: e.g. performance killer on powerpc
>#define likely(x) __builtin_expect (!!(x), 1)
#define unlikely(x) __builtin_expect (!!(x), 0)
> Incidentally: this code isn’t portable anymore!
Tools
>gcc -S
• Play with gcc flags and take a look at the generated results!
> rdtscll
> cachegrind – a cache profiler
• valgrind –tool=cachegrind command
• cg_annotate
> pahole (Arnaldo Carvalho de Melo)
• Utilize DWARF2 information
> pfunct print function details
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git
> vtune (Intel, Single Developer: $699)
Additional Information
> Some more tips:
• Avoid branching – use branchless instructions
• sched_setaffinity(2)
• Avoid SMP trashing
• CPUID opcode (CPU IDentification)
• gcc
• X86 Built-in Functions
(v2df __builtin_ia32_addsubpd (v2df, v2df))
• march=
• -msse or -msse2
• -mfpmath= (387, sse)
• Use strlen() – avoid while(*p++) ++i;
• PowerPC 4xx support dlmzb instruction
• dlmzb: determine left-most zero byte
• strcpy() is another candidate – dlmzb with support of lswx and stswx
> Intel Smart Memory Access
> Intel Advanced Smart Cache