Skip to content

Vi Notes

“set makeprg=waf” will cause “.make” compile code using waf.

SHELL
:! cmd – runs command in shell
:sh – opens subshell, ctrl-d returns to editor
:e file – edit a file

WINDOWS
:vsplit – ctrl-w switches
:split – ctrl-w switches

ctrl-w j|k|h|l – moves to window left|right|up|down

BUFFERS
:ls lists all open buffers
:b # opens buffer #
:bn or bp – buffer next, buffer previous
:bd #- delete buffer #
ctrl-O or ctrl-h jumps between buffers

Protected: Github Notes

This post is password protected. To view it please enter your password below:


Performance Tips Related To Cache

Below is a copy of http://www.jauu.net/data/pdf/beware-of-your-cacheline.pdf

Get cpu info on linux:

 cat /proc/cpuinfo

Seems core 2 duo has a 64 byte cache line size

Use prefetch

  • Data at contiguous memory locations and in ascending order
  • Use for loops that process large amounts of data
  • If data in loop is less than a cacheline: unroll loop (by hand to take full control – avoid -funroll-loops)
  • AMD PREFETCH and PREFETCHW
  • K6-2:
  • Small performance improvement – share FSB, low priority
  • Don’t add overhead through useless prefetching
  • K6-3:
  • Separate backside bus • Up to 20% improvements
  • 64 bytes per prefetch
  •  AMD tip: prefetch 192 (3*64) bytes ahead or three cachelines (Prefetc Distance)
  • BTW: FSB and AMD is a separate topic (keyword HyperTransport, Memory controller, Direct Connect Architecture )

Usage (AMD):

double a[A_REALLY_LARGE_NUMBER];
double b[A_REALLY_LARGE_NUMBER];
double c[A_REALLY_LARGE_NUMBER];
for (i=0; i<A_REALLY_LARGE_NUMBER/4; i++) {
  prefetchw (a[i*4+64]); // will be modifying a
  prefetch(b[i*4+64]);
  prefetch(c[i*4+64]);
  a[i*4] =   b[i*4]  *   c[i*4];
  a[i*4+1] = b[i*4+1]*   c[i*4+1];
  a[i*4+2] = b[i*4+2]*   c[i*4+2];
  a[i*4+3] = b[i*4+3]*   c[i*4+3];
}

memcpy
> movdqa/movdqu (aligned/unaligned) • SIMD extension
• Moves a double quadword from/to mmx register to/from mmx register/memory location
• movdqa: 16 byte aligned or general-protection exception
• Advantages: faster ordinary memcpy for bigger sizes
> p4: prefetchnta, prefetcht0, prefetcht1, prefetcht2
• p3: 32 byte
• p4: 128 byte
> Like madivce(2) it’s a hint and not a command
> prefetchnta • Prefetches data into L2 cache without polluting L1 caches (non temporal)
• SMP: reduce the CPU – cache traffic
> BTW: icc replace memcpy() in string.h with a processor specific version
(handles alignment, tlb priming and prefetching)

MMX, SSE, SSE2, SSE4
> SIMD: single instruction, multiple data
> More then one instruction processed concurrently
> MMX: • Concurrent processing with FP unit -> restore register
• mm[0-7] are aliases for r[0-7] (FPU Register)
• No over- or underflow exceptions (no carry, overflow or adjust flags) like every SIMD instructions
• Wrap-Around or saturation mode!

> SSE2:
• Mainly graphics and codecs precessing
• Added 122 instructions

> SSE3:
• Introduction: Pentium 4
• 13 additional SIMD instructions
Thread synchronisation (MONITOR, MWAIT)

> SSSE3:
• Xeon 5100 and Core 2 Duo
• 32 new instructions (16 x 64-bit MMX, 16 x 128-bit XMM registers)

> Intel’s Core-Architecture extension: SSE4 (Nehalem New Instructions)
• 50 new OpCodes
• Major improvements:
• String and text processing
• Vectorizing • Hardware CRC32 calculations (crc32)
• New advanced string instructions
• Release: 2008

GCC Attributes
> packed • Specifies a minimum alignment
• Unaligned exception

 int x __attribute__ ((aligned (16))) = 0; (align on a 16byte boundary)
 short array[3] __attribute__ ((aligned)); (unspecific alignment - largest alignment)

> packed
• Smallest possible alignment

 struct foo {
  char a;
  int x[2] __attribute__ ((packed));
 };
 struct __attribute__ ((packed)) bar { ... };

• Beware of you architecture: e.g. performance killer on powerpc

>#define likely(x) __builtin_expect (!!(x), 1)
#define unlikely(x) __builtin_expect (!!(x), 0)
> Incidentally: this code isn’t portable anymore! ;-)

Tools
>gcc -S
• Play with gcc flags and take a look at the generated results!
> rdtscll
> cachegrind – a cache profiler
• valgrind –tool=cachegrind command
• cg_annotate
> pahole (Arnaldo Carvalho de Melo)
• Utilize DWARF2 information
> pfunct print function details
> git://git.kernel.org/pub/scm/linux/kernel/git/acme/pahole.git
> vtune (Intel, Single Developer: $699)

Additional Information
> Some more tips:
• Avoid branching – use branchless instructions
• sched_setaffinity(2)
• Avoid SMP trashing
• CPUID opcode (CPU IDentification)
• gcc
• X86 Built-in Functions
(v2df __builtin_ia32_addsubpd (v2df, v2df))
• march=
• -msse or -msse2
• -mfpmath= (387, sse)
• Use strlen() – avoid while(*p++) ++i;
• PowerPC 4xx support dlmzb instruction
• dlmzb: determine left-most zero byte
• strcpy() is another candidate – dlmzb with support of lswx and stswx
> Intel Smart Memory Access
> Intel Advanced Smart Cache

Looking At Registers and Memory and Little-Endian Rears Its Head.

Still using a version of hello world:

#include <stdio.h>

int main(void)
{
        printf("Hello World\n");
        return 0;
}

set a breakpoint at main, run and when it breaks type “info registers”

$ gdb -q hello
Reading symbols for shared libraries .. done
(gdb) break main
Breakpoint 1 at 0x100000f10: file hello.c, line 5.
(gdb) run
Starting program: /Users/kristofe/Documents/Projects/hacking/eraseme/hello
Reading symbols for shared libraries +. done

Breakpoint 1, main () at hello.c:5
5		printf("Hello World\n");
(gdb) info registers
rax            0x100000ed0	4294971088
rbx            0x0	0
rcx            0x7fff5fbff7b8	140734799804344
rdx            0x7fff5fbff708	140734799804168
rsi            0x7fff5fbff6f8	140734799804152
rdi            0x1	1
rbp            0x7fff5fbff6d0	0x7fff5fbff6d0
rsp            0x7fff5fbff6d0	0x7fff5fbff6d0
r8             0x31	49
r9             0x0	0
r10            0x1200	4608
r11            0x206	518
r12            0x0	0
r13            0x0	0
r14            0x0	0
r15            0x0	0
rip            0x100000f10	0x100000f10 
eflags         0x206	518
cs             0x27	39
ss             0x0	0
ds             0x0	0
es             0x0	0
fs             0x0	0
gs             0x0	0
(gdb)

Since I am using osx people using linux might be a little thrown off. Seems that they are similar with eip->rip, eax->rax .. etc.

You can look at a single register by using the “info register” command. You can use the shorthand of “i r register” as well.

gdb) info register rip
rip            0x100000f10	0x100000f10 
(gdb)

The “x” command is short for examine and it is used to examine memory. You can format the memory display bye putting a “/” followed by an optional number and a letter that is shorthand for a format.

o Display in octal
x Display in hex
u Display in unsigned base-10 decimal
t Display in binary
s Display a string. It will stop at null char.
I Display as an assembly instruction. E.g x/5i $rip -> will show the next five assembly instructions that are going to execute.

if you add another letter after the format you can specify how much memory to show

b a single byte
h a halfword ( 2 bytes )
w a word ( 4 bytes ) – Confusing because this is usually considered a DWORD
g a giant ( 8 bytes )

gdb) i r rip
rip            0x100000f10	0x100000f10 
(gdb) x/x 0x100000f10
0x100000f10 :	0x193d8d48
(gdb) x/u 0x100000f10
0x100000f10 :	423464264

Instead of directly typing the memory address in rip we can dereference it by using a “$” before the register. Just like “*” in C/C++. Notice the values and how they are affected by the format. I will discuss below what is going on.

(gdb) x/2x $rip
0x100000f10 :	0x193d8d48	0xe8000000
(gdb) x/8x $rip
0x100000f10 :	0x193d8d48	0xe8000000	0x0000000e	0x000000b8
0x100000f20 :	0x00c3c900	0x010e25ff	0x25ff0000	0x00000110
(gdb) x/8xb $rip
0x100000f10 :	0x48	0x8d	0x3d	0x19	0x00	0x00	0x00	0xe8
(gdb) x/8xh $rip
0x100000f10 :	0x8d48	0x193d	0x0000	0xe800	0x000e	0x0000	0x00b8	0x0000
(gdb) x/8xw $rip
0x100000f10 :	0x193d8d48	0xe8000000	0x0000000e	0x000000b8
0x100000f20 :	0x00c3c900	0x010e25ff	0x25ff0000	0x00000110
(gdb)

Notice that when we change the amount of memory we are looking at the values change. Especially look at the lines where I put in the memory size (x/8xb $rip). Each line after is twice as big and seems to take the neighboring values above and swap them. It’s because the High Order bytes are being swapped with the Low Order ones because 80×86 CPU’s are Little Endian! The least significant byte is stored first! So the bytes are stored in reverse order. GDB knows about this and formats the values correctly for each size. The actual way the memory is laid out is shown when you look at the locations a byte at a time. But if you treat the values as sets of bytes they need to be swapped around and that is what GDB does for you when you look at larger chunks. If I messed this up please post a comment.

Tagged ,

Viewing Disassembly on OSX with GDB

Lets start with a version of hello world:

#include <stdio.h>

int main(void)
{
        printf("Hello World\n");
        return 0;
}

Compile it so it generates debug symbols by using the -g flag (a hello.dSYM directory will be generated). The -o flag specifies the filename of the executable instead of the default a.out.

$ gcc -g -o hello hello.c

Now you can debug it in gdb with the following. The -q flag is the “quiet” option that suppresses the introductory and copyright messages.

$ gdb -q hello
Reading symbols for shared libraries .. done
(gdb)

Because you generated debug information (that .dSYM directory) you can type list:

(gdb) list
1	#include
2
3	void main(void)
4	{
5		printf("Hello World\n");
6	}
(gdb)

Here is what it outputs from a series of commands:

$ gdb -q test
Reading symbols for shared libraries .. done
(gdb) list
1	#include
2
3	void main(void)
4	{
5		printf("Hello World\n");
6	}
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000100000f10 :	push   %rbp
0x0000000100000f11 :	mov    %rsp,%rbp
0x0000000100000f14 :	lea    0x13(%rip),%rdi        # 0x100000f2e
0x0000000100000f1b :	callq  0x100000f28
0x0000000100000f20 :	leaveq
0x0000000100000f21 :	retq
End of assembler dump.
(gdb) set disassembly-flavor intel
(gdb) disassemble main
Dump of assembler code for function main:
0x0000000100000f10 :	push   rbp
0x0000000100000f11 :	mov    rbp,rsp
0x0000000100000f14 :	lea    rdi,[rip+0x13]        # 0x100000f2e
0x0000000100000f1b :	call   0x100000f28
0x0000000100000f20 :	leave
0x0000000100000f21 :	ret
End of assembler dump.
(gdb)

Notice that gdb defaults to AT&T style assembly. I like Intel based so you can see the command

(gdb) set disassembly-flavor intel

You can have gdb always output intel style assembly by putting that command into ~/.gdbinit.

$ echo "set disassembly-flavor intel" > ~/.gdbinit
Tagged ,

Switch to our mobile site