In this issue, we will be looking at the Linux perf tool, which can be used in various ways to do performance analysis and instrumentation of software.
When collecting data via tracepoints, perf works by instrumenting running code (user- or kernel-space), causing it to emit events at predefined tracepoints in the code. In userspace programs, these tracepoints are defined by application developers and are referred to as Userspace Statically Defined Tracepoints (USDTs). In the Linux kernel, these tracepoints are defined by the kernel developers in the various sub-systems in use in the kernel. Each tracepoint can emit arbitrary data from the executing program’s state (e.g. the process ID); these data are copied into a memory-mapped file used as a buffer between the instrumented code and the perf program.
Perf can also profile executing programs by simply sampling the instruction pointer (IP) to determine where the program is spending time when on the processor. Sampling occurs at a predefined interval, to keep overhead relatively low.
The operations mentioned above are completed using the record subcommand, yielding a binary file named (by default) perf.data. Other subcommands (script, report) can be used to view the information stored in the binary file.
Using perf to profile code is relatively straightforward. Here we are profiling a simple Java example that launches several threads to update a counter using either a lock, or an atomic update.
First, we can record the instructions event, capturing stack traces, and specifying the process ID of our program:
perf record -e instructions -g -p $PID
Running perf script to view the traces shows us a stack trace containing kernel symbols, ‘unknown’ symbols, and some symbols from the JVM:
# thread pid timestamp event atomic-inc-1 450679 207951.673422: 1 instructions: # .. kernel traces 7f1ae4378bdb [unknown] (/tmp/perf-450642.map) 7f1adc8a2a26 [unknown] (/tmp/perf-450642.map) 7f1adc8a2c38 [unknown] (/tmp/perf-450642.map) 7f1adc8a2bf3 [unknown] (/tmp/perf-450642.map) 7f1adc8a2c38 [unknown] (/tmp/perf-450642.map) 7f1adc899849 [unknown] (/tmp/perf-450642.map) 7f1afa7df5d9 JavaCalls::call_helper+0x349 \ (/opt/jvm/lib/server/libjvm.so)
The first problem we encounter is that the code symbol names are not resolved. When using a JIT-compiled runtime such as the JVM or dotnet core, we need to provide a way for perf to map references to symbol names.
In the kernel, this is done using the /proc/kallsyms file, which contains a mapping between the stack references collected by perf, and the name of the associated method:
$ grep " ext4_write_begin" /proc/kallsyms ffffffffaf3b2fc0 t ext4_write_begin # address symbol name
The perf program expects missing symbols to be in a map file (shown by
/tmp/perf-$PID.map above). For JVM-based applications, we can use perf-map-agent  to generate the required symbols. For dotnet core, we can export the environment variable
COMPlus_PerfMapEnabled=1 to generate a perf symbol map while the program is running.
Once this approach has been taken, we can start to get a more useful record of the program’s execution profile:
atomic-inc-0 452021 208883.057213: 152434 instructions: 7f343c37b25b Lcom/aitusoftware/workshop/profiling/ \ examples/locking/ \ LockingExamples$AtomicIncrementer;::run+0x9b \ (/tmp/perf-451983.map) 7f34348a2a26 Interpreter+0x1526 (/tmp/perf-451983.map)
For a visual display of the execution profile, we can use Brendan Gregg’s Flamegraph  library to generate an interactive SVG file.
This execution profile view can be useful for determining where in our programs there are hotspots. These are the best places to start looking for optimisation opportunities (if required).
Another use of the perf command is to look at scheduling behaviour of threads in our programs. For instance, we may find that the operating system is preventing our programs from running in order to perform some house-keeping, or for example servicing a network card.
Our programs may be voluntarily yielding CPU time if locking is used. To gain insight into this behaviour, we can use the tracepoints in the Linux scheduler.
We will record the
sched:sched_switch events emitted by the scheduler to determine whether our program is actually running on CPU:
perf record -e sched:sched_switch -g -p $PID
Using perf script again to view the traces, we can observe the effect of using locks. Our thread is scheduled off CPU to be replaced by the idle process (swapper) when it tries and fails to acquire a lock:
locked-inc-1 452719  209497.094677: sched:sched_switch: prev_comm=locked-inc-1 # thread coming off CPU prev_pid=452719 # thread pid prev_prio=125 # thread priority prev_state=S ==> # thread state (S = sleep) next_comm=swapper/4 # thread coming on CPU next_pid=0 # thread pid next_prio=120 # thread priority schedule+0x42 ([kernel.kallsyms]) futex_wait_queue_me+0xbb ([kernel.kallsyms]) futex_wait+0x105 ([kernel.kallsyms]) do_futex+0x10f ([kernel.kallsyms]) __x64_sys_futex+0x13f ([kernel.kallsyms]) do_syscall_64+0x57 ([kernel.kallsyms]) entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms]) __lll_lock_wait+0x30 (/lib/x86_64-linux-gnu/libpthread-2.31.so) Unsafe_Unpark+0x153 (/opt/jvm/lib/server/libjvm.so) Ljdk/internal/misc/Unsafe;::unpark+0xcf Lcom/aitusoftware/workshop/profiling/ \ examples/locking/LockingExamples$Counter; \ ::lockedIncrementAndGet+0x748
The thread that performs atomic increments however is also descheduled to make way for other processes. Here we can see that its state is still runnable [R] when it is moved off the CPU:
atomic-inc-0 452720  209506.204522: sched:sched_switch: prev_comm=atomic-inc-0 prev_pid=452720 prev_prio=125 prev_state=R # thread was runnable ==> next_comm=kworker/2:1 # kernel worker thread next_pid=450119 next_prio=120 schedule+0x42 ([kernel.kallsyms]) exit_to_usermode_loop+0x8f prepare_exit_to_usermode+0x77 swapgs_restore_regs_and_return_to_usermode+0x0 Lcom/aitusoftware/workshop/profiling/ \ examples/locking/LockingExamples$AtomicIncrementer; \ ::run+0xb2 (/tmp/perf-452683.map)
Open-source tools such as grav  can be used to generate a visual guide to program scheduler behaviour.
When tuning latency-sensitive systems, it is important to understand why threads may be descheduled, and to measure how long they are off-CPU. This can help to explain unexpected latency spikes, and system tracing tools such as perf are vital in investigating these kinds of issues.
Collecting execution profiles from JVM-based programs has become much easier in the last few years. Async-profiler  and JMC  both provide a much more integrated solution, and can also collect memory profiling information. We will take a look at these JVM-specific profilers in a future edition of the newsletter.
There are also many other tools for performing kernel tracing, at various levels of maturity. LLTng, SystemTap, FTrace, and eBPF are all worth investigating to get a better idea of the kind of capabilities available for system-level profiling.
Subscribe to receive the Four Steps Newsletter: