Performance Analysis with Perf

In this issue, we will be looking at the Linux perf tool, which can be used in various ways to do performance analysis and instrumentation of software.

When collecting data via tracepoints, perf works by instrumenting running code (user- or kernel-space), causing it to emit events at predefined tracepoints in the code. In userspace programs, these tracepoints are defined by application developers and are referred to as Userspace Statically Defined Tracepoints (USDTs). In the Linux kernel, these tracepoints are defined by the kernel developers in the various sub-systems in use in the kernel. Each tracepoint can emit arbitrary data from the executing program’s state (e.g. the process ID); these data are copied into a memory-mapped file used as a buffer between the instrumented code and the perf program.

Perf can also profile executing programs by simply sampling the instruction pointer (IP) to determine where the program is spending time when on the processor. Sampling occurs at a predefined interval, to keep overhead relatively low.

The operations mentioned above are completed using the record subcommand, yielding a binary file named (by default) Other subcommands (script, report) can be used to view the information stored in the binary file.

Execution Profiling

Using perf to profile code is relatively straightforward. Here we are profiling a simple Java example that launches several threads to update a counter using either a lock, or an atomic update.
First, we can record the instructions event, capturing stack traces, and specifying the process ID of our program:

perf record -e instructions -g -p $PID

Running perf script to view the traces shows us a stack trace containing kernel symbols, ‘unknown’ symbols, and some symbols from the JVM:

#     thread    pid     timestamp                 event
atomic-inc-1 450679 207951.673422:          1 instructions:
# .. kernel traces
            7f1ae4378bdb [unknown] (/tmp/
            7f1adc8a2a26 [unknown] (/tmp/
            7f1adc8a2c38 [unknown] (/tmp/
            7f1adc8a2bf3 [unknown] (/tmp/
            7f1adc8a2c38 [unknown] (/tmp/
            7f1adc899849 [unknown] (/tmp/
            7f1afa7df5d9 JavaCalls::call_helper+0x349 \

The first problem we encounter is that the code symbol names are not resolved. When using a JIT-compiled runtime such as the JVM or dotnet core, we need to provide a way for perf to map references to symbol names. 

In the kernel, this is done using the /proc/kallsyms file, which contains a mapping between the stack references collected by perf, and the name of the associated method:

$ grep " ext4_write_begin" /proc/kallsyms
ffffffffaf3b2fc0 t ext4_write_begin
# address          symbol name

The perf program expects missing symbols to be in a map file (shown by /tmp/perf-$ above). For JVM-based applications, we can use perf-map-agent [1] to generate the required symbols. For dotnet core, we can export the environment variable COMPlus_PerfMapEnabled=1 to generate a perf symbol map while the program is running.

Once this approach has been taken, we can start to get a more useful record of the program’s execution profile:

atomic-inc-0 452021 208883.057213:     152434 instructions:

    7f343c37b25b Lcom/aitusoftware/workshop/profiling/ \
        examples/locking/ \
        LockingExamples$AtomicIncrementer;::run+0x9b \
    7f34348a2a26 Interpreter+0x1526 (/tmp/

For a visual display of the execution profile, we can use Brendan Gregg’s Flamegraph [2] library to generate an interactive SVG file.

This execution profile view can be useful for determining where in our programs there are hotspots. These are the best places to start looking for optimisation opportunities (if required).

Event Tracing

Another use of the perf command is to look at scheduling behaviour of threads in our programs. For instance, we may find that the operating system is preventing our programs from running in order to perform some house-keeping, or for example servicing a network card.

Our programs may be voluntarily yielding CPU time if locking is used. To gain insight into this behaviour, we can use the tracepoints in the Linux scheduler.

We will record the sched:sched_switch events emitted by the scheduler to determine whether our program is actually running on CPU:

perf record -e sched:sched_switch -g -p $PID

Using perf script again to view the traces, we can observe the effect of using locks. Our thread is scheduled off CPU to be replaced by the idle process (swapper) when it tries and fails to acquire a lock:

locked-inc-1 452719 [004] 209497.094677: sched:sched_switch:
 prev_comm=locked-inc-1  # thread coming off CPU
 prev_pid=452719        # thread pid
 prev_prio=125          # thread priority
 prev_state=S ==>       # thread state (S = sleep)
 next_comm=swapper/4    # thread coming on CPU
 next_pid=0             # thread pid
 next_prio=120          # thread priority
  schedule+0x42 ([kernel.kallsyms])
  futex_wait_queue_me+0xbb ([kernel.kallsyms])
  futex_wait+0x105 ([kernel.kallsyms])
  do_futex+0x10f ([kernel.kallsyms])
  __x64_sys_futex+0x13f ([kernel.kallsyms])
  do_syscall_64+0x57 ([kernel.kallsyms])
  entry_SYSCALL_64_after_hwframe+0x44 ([kernel.kallsyms])
  __lll_lock_wait+0x30 (/lib/x86_64-linux-gnu/
  Unsafe_Unpark+0x153 (/opt/jvm/lib/server/
  Lcom/aitusoftware/workshop/profiling/ \
    examples/locking/LockingExamples$Counter; \

The thread that performs atomic increments however is also descheduled to make way for other processes. Here we can see that its state is still runnable [R] when it is moved off the CPU:

atomic-inc-0 452720 [002] 209506.204522: sched:sched_switch:
 prev_state=R # thread was runnable
 next_comm=kworker/2:1 # kernel worker thread
  schedule+0x42 ([kernel.kallsyms])
  Lcom/aitusoftware/workshop/profiling/ \
    examples/locking/LockingExamples$AtomicIncrementer; \
    ::run+0xb2 (/tmp/

Open-source tools such as grav [3] can be used to generate a visual guide to program scheduler behaviour.

When tuning latency-sensitive systems, it is important to understand why threads may be descheduled, and to measure how long they are off-CPU. This can help to explain unexpected latency spikes, and system tracing tools such as perf are vital in investigating these kinds of issues.

Other solutions

Collecting execution profiles from JVM-based programs has become much easier in the last few years. Async-profiler [4] and JMC [5] both provide a much more integrated solution, and can also collect memory profiling information. We will take a look at these JVM-specific profilers in a future edition of the newsletter.

There are also many other tools for performing kernel tracing, at various levels of maturity. LLTng, SystemTap, FTrace, and eBPF are all worth investigating to get a better idea of the kind of capabilities available for system-level profiling.






Subscribe to receive the Four Steps Newsletter: