An introduction to bpftrace for Linux

Bpftrace is a new open source tracer for Linux for analyzing production performance problems and troubleshooting software. Its users and contributors include Netflix, Facebook, Red Hat, Shopify, and others, and it was created by Alastair Robertson, a talented UK-based developer who has won various coding competitions.

Linux already has many performance tools, but they are often counter-based and have limited visibility. For example, iostat(1) or a monitoring agent may tell you your average disk latency, but not the distribution of this latency. Distributions can reveal multiple modes or outliers, either of which may be the real cause of your performance problems. Bpftrace is suited for this kind of analysis: decomposing metrics into distributions or per-event logs and creating new metrics for visibility into blind spots.

You can use bpftrace via one-liners or scripts, and it ships with many prewritten tools. Here is an example that traces the distribution of read latency for PID 181 and shows it as a power-of-two histogram:

# bpftrace -e ‘kprobe:vfs_read /pid == 30153/ { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ { @ns = hist(nsecs – @start[tid]); delete(@start[tid]); }’
Attaching 2 probes…
^C

@ns:
[256, 512)         10900 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[512, 1k)          18291 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1k, 2k)            4998 |@@@@@@@@@@@@@@                                      |
[2k, 4k)              57 |                                                    |
[4k, 8k)             117 |                                                    |
[8k, 16k)             48 |                                                    |
[16k, 32k)           109 |                                                    |
[32k, 64k)             3 |                                                    |


This example instruments one event out of thousands available. If you have some weird performance problem, there’s probably some bpftrace one-liner that can shed light on it. For large environments, this ability can help you save millions. For smaller environments, it can be of more use in helping to eliminate latency outliers.

I previously wrote about bpftrace vs. other tracers, including BCC (BPF Compiler Collection). BCC is great for canned complex tools and agents. Bpftrace is best for short scripts and ad hoc investigations. In this article, I’ll summarize the bpftrace language, variable types, probes, and tools.

Bpftrace uses BPF (Berkeley Packet Filter), an in-kernel execution engine that processes a virtual instruction set. BPF has been extended (aka eBPF) in recent years for providing a safe way to extend kernel functionality. It also has become a hot topic in systems engineering, with at least 24 talks on BPF at the last Linux Plumber’s Conference. BPF is in the Linux kernel, and bpftrace is the best way to get started using BPF for observability.

See the bpftrace INSTALL guide for how to install it, and get the latest version; 0.9.2 was just released. For Kubernetes clusters, there is also kubectl-trace for running it.

Syntax

probe[,probe,...] /filter/ { action }

The probe specifies what events to instrument. The filter is optional and can filter down the events based on a boolean expression, and the action is the mini-program that runs.

Here’s hello world:

# bpftrace -e 'BEGIN { printf("Hello eBPF!\n"); }'

The probe is BEGIN, a special probe that runs at the beginning of the program (like awk). There’s no filter. The action is a printf() statement.

Now a real example:

# bpftrace -e 'kretprobe:sys_read /pid == 181/ { @bytes = hist(retval); }'

This uses a kretprobe to instrument the return of the sys_read() kernel function. If the PID is 181, a special map variable @bytes is populated with a log2 histogram function with the return value retval of sys_read(). This produces a histogram of the returned read size for PID 181. Is your app doing lots of one byte reads? Maybe that can be optimized.

Probe types

These are libraries of related probes. The currently supported types are (more will be added):

Type Description
tracepoint Kernel static instrumentation points
usdt User-level statically defined tracing
kprobe Kernel dynamic function instrumentation
kretprobe Kernel dynamic function return instrumentation
uprobe User-level dynamic function instrumentation
uretprobe User-level dynamic function return instrumentation
software Kernel software-based events
hardware Hardware counter-based instrumentation
watchpoint Memory watchpoint events (in development)
profile Timed sampling across all CPUs
interval Timed reporting (from one CPU)
BEGIN Start of bpftrace
END End of bpftrace

Dynamic instrumentation (aka dynamic tracing) is the superpower that lets you trace any software function in a running binary without restarting it. This lets you get to the bottom of just about any problem. However, the functions it exposes are not considered a stable API, as they can change from one software version to another. Hence static instrumentation, where event points are hard-coded and become a stable API. When you write bpftrace programs, try to use the static types first, before the dynamic ones, so your programs are more stable.

Variable types

Variable Description
@name global
@name[key] hash
@name[tid] thread-local
$name scratch

Variables with an @ prefix use BPF maps, which can behave like associative arrays. They can be populated in one of two ways:

  • Variable assignment: @name = x;
  • Function assignment: @name = hist(x);

Various map-populating functions are built in to provide quick ways to summarize data.

Built-in variables and functions

Here are some of the built-in variables and functions, but there are many more.

Built-in variables:

Variable Description
pid process ID
comm Process or command name
nsecs Current time in nanoseconds
kstack Kernel stack trace
ustack User-level stack trace
arg0…argN Function arguments
args Tracepoint arguments
retval Function return value
name Full probe name

Built-in functions:

Function Description
printf(“…”) Print formatted string
time(“…”) Print formatted time
system(“…”) Run shell command
@ = count() Count events
@ = hist(x) Power-of-2 histogram for x
@ = lhist(x, min, max, step) Linear histogram for x

See the reference guide for details.

One-liners tutorial

A great way to learn bpftrace is via one-liners, which I turned into a one-liners tutorial that covers the following:

Listing probes bpftrace -l ‘tracepoint:syscalls:sys_enter_*’
Hello world bpftrace -e ‘BEGIN { printf(“hello world\n”) }’
File opens bpftrace -e ‘tracepoint:syscalls:sys_enter_open { printf(“%s %s\n”, comm, str(args->filename)) }’
Syscall counts by process bpftrace -e ‘tracepoint:raw_syscalls:sys_enter { @[comm] = count() }’
Distribution of read() bytes bpftrace -e ‘tracepoint:syscalls:sys_exit_read /pid == 18644/ { @bytes = hist(args->retval) }’
Kernel dynamic tracing of read() bytes bpftrace -e ‘kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200) }’
Timing read()s bpftrace -e ‘kprobe:vfs_read { @start[tid] = nsecs } kretprobe:vfs_read /@start[tid]/ { @ns[comm] = hist(nsecs – @start[tid]); delete(@start[tid]) }’
Count process-level events bpftrace -e ‘tracepoint:sched:sched* { @[name] = count() } interval:s:5 { exit() }’
Profile on-CPU kernel stacks bpftrace -e ‘profile:hz:99 { @[stack] = count() }’
Scheduler tracing bpftrace -e ‘tracepoint:sched:sched_switch { @[stack] = count() }’
Block I/O tracing bpftrace -e ‘tracepoint:block:block_rq_issue { @ = hist(args->bytes); }
Kernel struct tracing (a script, not a one-liner) Command: bpftrace path.bt, where the path.bt file is:

#include <linux/path.h>
#include <linux/dcache.h>

kprobe:vfs_open { printf(“open path: %s\n”, str(((path *)arg0)->dentry->d_name.name)); }

See the tutorial for an explanation of each.

Apart from one-liners, bpftrace programs can be multi-line scripts. Bpftrace ships with 28 of them as tools:

These can be found in the /tools directory:

tools# ls *.bt
bashreadline.bt  dcsnoop.bt         oomkill.bt    syncsnoop.bt   vfscount.bt
biolatency.bt    execsnoop.bt       opensnoop.bt  syscount.bt    vfsstat.bt
biosnoop.bt      gethostlatency.bt  pidpersec.bt  tcpaccept.bt   writeback.bt
bitesize.bt      killsnoop.bt       runqlat.bt    tcpconnect.bt  xfsdist.bt
capable.bt       loads.bt           runqlen.bt    tcpdrop.bt
cpuwalk.bt       mdflush.bt         statsnoop.bt  tcpretrans.bt


Apart from their use in diagnosing performance issues and general troubleshooting, they also provide another way to learn bpftrace. Here are some examples.

Source

Here’s the code to biolatency.bt:

tools# cat -n biolatency.bt
1  /*
2   * biolatency.bt    Block I/O latency as a histogram.
3   *                  For Linux, uses bpftrace, eBPF.
4   *
5   * This is a bpftrace version of the bcc tool of the same name.
6   *
7   * Copyright 2018 Netflix, Inc.
8   * Licensed under the Apache License, Version 2.0 (the “License”)
9   *
10   * 13-Sep-2018  Brendan Gregg   Created this.
11   */
12
13  BEGIN
14  {
15          printf(“Tracing block device I/O… Hit Ctrl-C to end.\n”);
16  }
17
18  kprobe:blk_account_io_start
19  {
20          @start[arg0] = nsecs;
21  }
22
23  kprobe:blk_account_io_done
24  /@start[arg0]/
25
26  {
27          @usecs = hist((nsecs – @start[arg0]) / 1000);
28          delete(@start[arg0]);
29  }
30
31  END
32  {
33          clear(@start);
34  }


It’s straightforward, easy to read, and short enough to include on a slide. This version uses kernel dynamic tracing to instrument the blk_account_io_start() and blk_account_io_done() functions, and it passes a timestamp between them keyed on arg0 to each. arg0 on kprobe is the first argument to that function, which is the struct request *, and its memory address is used as a unique identifier.

Example files

You can see screenshots and explanations of these tools in the GitHub repo as *_example.txt files. For example:

tools# more biolatency_example.txt
Demonstrations of biolatency, the Linux BPF/bpftrace version.

This traces block I/O, and shows latency as a power-of-2 histogram. For example:

# biolatency.bt
Attaching 3 probes…
Tracing block device I/O… Hit Ctrl-C to end.
^C

@usecs:
[256, 512)             2 |                                                    |
[512, 1K)             10 |@                                                   |
[1K, 2K)             426 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K)             230 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[4K, 8K)               9 |@                                                   |
[8K, 16K)            128 |@@@@@@@@@@@@@@@                                     |
[16K, 32K)            68 |@@@@@@@@                                            |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)          10 |@                                                   |

While tracing, this shows that 426 block I/O had a latency of between 1K and 2K
usecs (1024 and 2048 microseconds), which is between 1 and 2 milliseconds.
There are also two modes visible, one between 1 and 2 milliseconds, and another
between 8 and 16 milliseconds: this sounds like cache hits and cache misses.
There were also 10 I/O with latency 128 to 256 ms: outliers. Other tools and
instrumentation, like biosnoop.bt, can shed more light on those outliers.
[…]


Sometimes it can be most effective to switch straight to the example file when trying to understand these tools, since the output may be self-evident (by design!).

Man pages

There are also man pages for every tool in the GitHub repo under /man/man8. They include sections on the output fields and the tool’s expected overhead.

# nroff -man man/man8/biolatency.8
biolatency(8)               System Manager’s Manual              biolatency(8)

NAME
biolatency.bt – Block I/O latency as a histogram. Uses bpftrace/eBPF.

SYNOPSIS
biolatency.bt

DESCRIPTION
This  tool  summarizes  time  (latency) spent in block device I/O (disk
I/O) as a power-of-2 histogram. This  allows  the  distribution  to  be
studied,  including  modes and outliers. There are often two modes, one
for device cache hits and one for cache misses, which can be  shown  by
this tool. Latency outliers will also be shown.
[…]


Writing all these man pages was the least fun part of developing these tools, and some took longer to write than the tool took to develop, but it’s nice to see the final result.

bpftrace vs. BCC

Since eBPF has been merging in the kernel, most effort has been placed on the BCC frontend, which provides a BPF library and Python, C++, and Lua interfaces for writing programs. I’ve developed a lot of tools in BCC/Python; it works great, although coding in BCC is verbose. If you’re hacking away at a performance issue, bpftrace is better for your one-off custom queries. If you’re writing a tool with many command-line options or an agent that uses Python libraries, you’ll want to consider using BCC.

On the Netflix performance team, we use both: BCC for developing canned tools that others can easily use and for developing agents; and bpftrace for ad hoc analysis. The network engineering team has been using BCC to develop an agent for its needs. The security team is most interested in bpftrace for quick ad hoc instrumentation for detecting zero-day vulnerabilities. And I expect the developer teams will use both without knowing it, via the self-service GUIs we are building (Vector), and occasionally may SSH into an instance and run a canned tool or ad hoc bpftrace one-liner.

Learn more

I also have a book coming out this year that covers bpftrace: BPF Performance Tools: Linux System and Application Observability, to be published by Addison Wesley, and which contains many new bpftrace tools.


Thanks to Alastair Robertson for creating bpftrace, and the bpftrace, BCC, and BPF communities for all the work over the past five years.

Source