BreakingExpress

How to investigate your system with perf and Python

Modern computer systems are ever rising in efficiency and capability. This issues little if that rising capability isn’t nicely utilized. Following is an outline of the motivation and work behind “curt,” a brand new device for Linux programs for measuring and breaking down system utilization by course of, by job, and by CPU utilizing the perf command’s Python scripting capabilities.

I had the privilege of presenting this matter at Texas Linux Fest 2018, and right here I’ve gone a bit deeper into the small print, included hyperlinks to additional info, and expanded the scope of my discuss.

System utilization

In discussing computation, let’s start with some assertions:

  1. Every computational system is equally quick at doing nothing.
  2. Computational programs have been created to do issues.
  3. A computational system is healthier at doing issues when it’s doing one thing than when it’s doing nothing.

Modern computational programs have many streams of execution:

  • Often, very massive programs are created by actually wiring collectively smaller programs. At IBM, these smaller programs are generally referred to as CECs (quick for Central Electronics Complexes and pronounced “keks”).
  • There are a number of sockets for processor modules in every system.
  • There are generally a number of chips per socket (within the type of dual-chip modules—DCMs—or multi-chip modules—MCMs).
  • There are a number of cores per chip.
  • There are a number of threads per core.

In sum, there are doubtlessly 1000’s of execution threads throughout a single computational system.

Ideally, all these execution streams are 100% busy doing helpful work. One measure of utilization for a person execution stream (CPU thread) is the proportion of time that thread has duties scheduled and operating. (Note that I did not say “doing helpful work.” Creating a device that measures helpful work is left as an train for the reader.) By extension, system utilization is the general proportion of time that each one execution streams of a system have duties scheduled and operating. Similarly, utilization will be outlined with respect to a person job. Task utilization is the proportion of the lifetime of the duty that was spent actively operating on any CPU thread. By extension, course of utilization is the collective utilization of its duties.

There are instruments that measure system utilization: uptime, vmstat, mpstat, nmon, and many others. There are instruments that measure particular person course of utilization: time. There are usually not many instruments that measure system-wide per-process and per-task utilization. One such command is curt on AIX. According to IBM’s Knowledge Center: “The curt command takes an AIX hint file as enter and produces plenty of statistics associated to processor (CPU) utilization and course of/thread/pthread exercise.”

The AIX curt command reviews system-wide, per-processor, per-process, and per-task statistics for utility processing (consumer time), system calls (system time), hypervisor calls, kernel threads, interrupts, and idle time.

This looks as if a superb mannequin for the same command for a Linux system.

Utilization information

Before beginning to create any instruments for utilization evaluation, you will need to know what information is required. Since utilization is straight associated as to whether a job is actively operating or not, associated scheduling occasions are required: When is the duty made to run, and when is it paused? Tracking on which CPU the duty runs is vital, so migration occasions are required for implicit migrations. There are additionally sure system calls that drive specific migrations. Creation and deletion of duties are clearly vital. Since we need to perceive consumer time, system time, hypervisor time, and interrupt time, occasions that present the transitions between these job states are required.

The Linux kernel accommodates “tracepoints” for all these occasions. It is feasible to allow tracing for these occasions straight within the kernel’s debugfs filesystem, often mounted at /sys/kernel/debug, within the tracing listing (/sys/kernel/debug/tracing).

An simpler solution to report tracing information is with the Linux perf command.

The perf command

perf is a really highly effective userspace command for tracing or counting each hardware and software program occasions.

Software occasions are predefined within the kernel, will be predefined in userspace code, and will be dynamically created (as “probes”) in kernel or userspace code.

perf can do far more than simply hint and rely, although.

perf stat

The stat subcommand of perf will run a command, rely some occasions generally discovered fascinating, and produce a easy report:

Performance counter stats for './load 100000':
 
      90537.006424      task-clock:u (msec)       #    1.00Zero CPUs utilized          
                 Zero      context-switches:u        #    Zero.00Zero Okay/sec                  
                 Zero      cpu-migrations:u          #    Zero.00Zero Okay/sec                  
               915      page-faults:u             #    Zero.Zero10 Okay/sec                  
   386,836,206,133      cycles:u                  #    Four.273 GHz                      (66.67%)
     three,488,523,420      stalled-cycles-frontend:u #    Zero.90% frontend cycles idle     (50.00%)
   287,222,191,827      stalled-cycles-backend:u  #   74.25% backend cycles idle      (50.00%)
   291,102,378,513      directions:u            #    Zero.75  insn per cycle        
                                                  #    Zero.99  stalled cycles per insn  (66.67%)
    43,730,320,236      branches:u                #  483.Zero10 M/sec                    (50.00%)
       822,030,340      branch-misses:u           #    1.88% of all branches          (50.00%)
 
      90.539972837 seconds time elapsed

perf report, perf report, and perf annotate

For far more fascinating evaluation, the perf command may also be used to report occasions and knowledge related to the duty state on the time the occasion occurred:

$ perf report ./some-command
[ perf record: Woken up 55 times to write data ]
[ perf record: Captured and wrote 13.973 MB perf.data (366158 samples) ]
$ perf report --stdio --show-nr-samples --percent-limit Four
# Samples: 366Okay of occasion 'cycles:u'
# Event rely (approx.): 388851358382
#
# Overhead       Samples  Command  Shared Object      Symbol                                          
# ........  ............  .......  .................  ................................................
#
    62.31%        228162  load     load               [.] most important
    19.29%         70607  load     load               [.] sum_add
    18.33%         67117  load     load               [.] sum_sub

This instance reveals a program that spends about 60% of its operating time within the perform most important and about 20% every in subfunctions sum_sub and sum_add. Note that the default occasion utilized by perf report is “cycles.” Later examples will present the right way to use perf report with different occasions.

perf report can additional report runtime statistics by supply code line (if the compilation was carried out with the -g flag to provide debug info):

$ perf report --stdio --show-nr-samples --percent-limit Four --sort=srcline
# Samples: 366Okay of occasion 'cycles:u'
# Event rely (approx.): 388851358382
#
# Overhead       Samples  Source:Line                        
# ........  ............  ...................................
#
    19.40%         71031  load.c:58
    16.16%         59168  load.c:18
    15.11%         55319  load.c:14
    13.30%         48690  load.c:66
    13.23%         48434  load.c:70
     Four.58%         16767  load.c:62
     Four.01%         14677  load.c:56

Further, perf annotate can present statistics for every instruction of this system:

$ perf annotate --stdio
Percent |      Source code & Disassembly of load for cycles:u (70607 samples)
------------------------------------------------------------------------------
         :      0000000010000774 <sum_add>:
         :      int sum_add(int sum, int worth)
    6.09 :        100007a0:   mr      r3,r9
    Zero.02 :        100007a4:   addi    r1,r31,64
    Zero.03 :        100007a8:   ld      r31,-Eight(r1)
    Zero.00 :        100007ac:   blr

(Note: this code isn’t optimized.)

perf prime

Similar to the prime command, which shows (at a daily replace interval) the processes utilizing essentially the most CPU time, perf prime will show the capabilities utilizing essentially the most CPU time amongst all processes on the system, a pleasant leap in granularity.

perf checklist

The examples to this point have used the default occasion, run cycles. There are lots of and maybe 1000’s of occasions of various varieties. perf checklist will present all of them. Following are only a few examples:

$ perf checklist
  directions                                       [Hardware event]
  context-switches OR cs                             [Software event]
  L1-icache-loads                                    [Hardware cache event]
  mem_access OR cpu/mem_access/                      [Kernel PMU event]
cache:
  pm_data_from_l2                                  
       [The processor's data cache was reloaded from local core's L2 due to a demand load]
floating level:
  pm_fxu_busy                                      
       [fxu0 busy and fxu1 busy]
frontend:
  pm_br_mpred_cmpl                                  
       [Number of Branch Mispredicts]
reminiscence:
  pm_data_from_dmem                                
       [The processor's data cache was reloaded from another chip's memory on the same Node or Group (Distant) due to a demand load]
  pm_data_from_lmem                                
       [The processor's data cache was reloaded from the local chip's Memory due to a demand load]
  rNNN                                               [Raw hardware event descriptor]
  raw_syscalls:sys_enter                             [Tracepoint event]
  syscalls:sys_enter_chmod                           [Tracepoint event]
  sdt_libpthread:pthread_create                      [SDT event]

Events labeled as Hardware occasion, Hardware cache occasion, Kernel PMU occasion, and most (if not all) of the occasions below the classes like cache, floating level, frontend, and reminiscence are hardware occasions counted by the hardware and triggered every time a sure rely is reached. Once triggered, an entry is made into the kernel hint buffer with the present state of the related job. Raw hardware occasion codes are alphanumeric encodings of the hardware occasions. These are largely wanted when the hardware is newer than the kernel and the consumer must allow occasions which are new for that hardware. Users will not often, if ever, want to make use of uncooked occasion codes.

Events labeled Tracepoint occasion are embedded within the kernel. These are triggered when that part of code is executed by the kernel. There are “syscalls” occasions for each system name supported by the kernel. raw_syscalls occasions are triggered for each system name. Since there’s a restrict to the variety of occasions being actively traced, the raw_syscalls occasions could also be extra sensible when numerous system calls should be traced.

Events labeled SDT occasion are for software-defined tracepoints (SDTs). These will be embedded in utility or library code and enabled as wanted. When enabled, they behave similar to different occasions: When that part of code is executed (by any job being traced on the system), an entry is made within the kernel hint buffer with the present state of the related job. This is a really highly effective functionality that may show very helpful.

perf buildid-cache and perf probe

Enabling SDTs is straightforward. First, make the SDTs for a sure library recognized to perf:

$ perf buildid-cache -v --add /lib/powerpc64le-linux-gnu/libpthread.so.Zero
$ perf checklist | grep libpthread
[…]
  sdt_libpthread:pthread_create                      [SDT event]
[…]

Then, flip SDT definitions into out there tracepoints:

$ /usr/bin/sudo perf probe sdt_libpthread:pthread_create
Added new occasion:
  sdt_libpthread:pthread_create (on %pthread_create in /lib/powerpc64le-linux-gnu/libpthread-2.27.so)
You can now use it in all perf instruments, comparable to:
    perf report -e sdt_libpthread:pthread_create -aR sleep 1
$ perf report -a -e sdt_libpthread:pthread_create ./check
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.199 MB perf.data (9 samples) ]

Note that any location in an utility or library will be made right into a tracepoint. To discover capabilities in an utility that may be made into tracepoints, use perf probe with –funcs:

$ perf probe –x ./load --funcs
[…]
most important
sum_add
sum_sub

To allow the perform most important of the ./load utility as a tracepoint:

/usr/bin/sudo perf probe –x ./load most important
Added new occasion:
  probe_load:most important      (on most important in /dwelling/computer/initiatives/load-2.1pc/load)
You can now use it in all perf instruments, comparable to:
    perf report –e probe_load:most important –aR sleep 1
$ perf checklist | grep load:most important
  probe_load:most important                                     [Tracepoint event]
$ perf report –e probe_load:most important ./load
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB perf.data (1 samples) ]

perf script

Continuing the earlier instance, perf script can be utilized to stroll by the perf.information file and output the contents of every report:

$ perf script
            Load 16356 [004] 80526.760310: probe_load:most important: (4006a2)

Processing perf hint information

The previous dialogue and examples present that perf can accumulate the info required for system utilization evaluation. However, how can that information be processed to provide the specified outcomes?

perf eBPF

A comparatively new and rising know-how with perf is known as eBPF. BPF is an acronym for Berkeley Packet Filter, and it’s a C-like language initially for, not surprisingly, community packet filtering within the kernel. eBPF is an acronym for prolonged BPF, an identical, however extra strong C-like language based mostly on BPF.

Recent variations of perf can be utilized to include compiled eBPF code into the kernel to securely and intelligently deal with occasions for any variety of functions, with some limitations.

The functionality could be very highly effective and fairly helpful for real-time, steady updates of event-related information and statistics.

However, as this functionality is rising, help is combined on present releases of Linux distributions. It’s a bit sophisticated (or, put otherwise, I’ve not figured it out but). It’s additionally just for on-line use; there isn’t a offline functionality. For these causes, I will not cowl it additional right here.

perf information file

perf report produces a perf.information file. The file is a structured binary file, isn’t significantly nicely documented, has no programming interface for entry, and is unclear on what compatibility ensures exist. For these causes, I selected to not straight use the perf.information file.

perf script

One of the final examples above confirmed how perf script is used for strolling by the perf.information file and emitting fundamental details about every report there. This is an applicable mannequin for what could be wanted to course of the file and monitor the state adjustments and compute the statistics required for system utilization evaluation.

perf script has a number of modes of operation, together with a number of higher-level scripts that include perf that produce statistics based mostly on the hint information in a perf.information file.

$ perf script -l
List of accessible hint scripts:
  rw-by-pid                            system-wide r/w exercise
  rwtop [interval]                     system-wide r/w prime
  wakeup-latency                       system-wide min/max/avg wakeup latency
  failed-syscalls [comm]               system-wide failed syscalls
  rw-by-file <comm>                    r/w exercise for a program, by file
  failed-syscalls-by-pid [comm]        system-wide failed syscalls, by pid
  intel-pt-events                      print Intel PT Power Events and PTWRITE
  syscall-counts-by-pid [comm]         system-wide syscall counts, by pid
  export-to-sqlite [database name] [columns] [calls] export perf information to a sqlite3 database
  futex-contention                     futext competition measurement
  sctop [comm] [interval]              syscall prime
  event_analyzing_sample               analyze all perf samples
  net_dropmonitor                      show a desk of dropped frames
  compaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] show time taken by mm compaction
  export-to-postgresql [database name] [columns] [calls] export perf information to a postgresql database
  stackcollapse                        produce callgraphs briefly kind for scripting use
  netdev-times [tx] [rx] [dev=] [debug] show a strategy of packet and processing time
  syscall-counts [comm]                system-wide syscall counts
  sched-migration                      sched migration overview
$ perf script failed-syscalls-by-pid /bin/ls
 
syscall errors:
 
comm [pid]                           rely
------------------------------  ----------
 
ls [18683]
  syscall: entry          
    err = ENOENT                         1
  syscall: statfs          
    err = ENOENT                         1
  syscall: ioctl          
    err = ENOTTY                         three

What do these scripts appear like? Let’s discover out.

$ find failed-syscalls-by-pid
/usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
[…]
$ rpm –qf /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
perf-Four.14.Zero-46.el7a.x86_64
$ $ ls /usr/libexec/perf-core/scripts
perl  python
$ perf script -s lang
 
Scripting language extensions (utilized in perf script -s [spec:]script.[spec]):
 
  Perl                                       [Perl]
  pl                                         [Perl]
  Python                                     [Python]
  py                                         [Python]

So, these scripts include perf, and each Python and Perl are supported languages.

Note that for the whole thing of this content material, I’ll refer completely to Python.

perf scripts

How do these scripts do what they do? Here are vital extracts from /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py:

def raw_syscalls__sys_exit(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, ret):
[]
        if ret < Zero:
[]
                        syscalls[common_comm][common_pid][id][ret] += 1

The perform raw_syscalls__sys_exit has parameters for all the info for the related occasion. The remainder of the perform solely increments a counter related to the command, course of ID, and system name. The remainder of the code would not try this a lot. Most of the complexity is within the perform signature for the event-handling routine.

Fortunately, perf makes it simple to determine the correct signatures for varied tracepoint event-handling capabilities.

perf script –gen-script

For the raw_syscalls occasions, we are able to generate a hint containing simply these occasions:

$ perf checklist | grep raw_syscalls
  raw_syscalls:sys_enter                             [Tracepoint event]
  raw_syscalls:sys_exit                              [Tracepoint event]
$ perf report -e 'raw_syscalls:*' /bin/ls >/dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (176 samples) ]

We can then have perf generate a script that accommodates pattern implementations of event-handling capabilities for the occasions within the perf.information file:

$ perf script --gen-script python
generated Python script: perf-script.py

What do we discover within the script?

def raw_syscalls__sys_exit(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, ret):
[]
def raw_syscalls__sys_enter(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, args):

Both event-handling capabilities are specified with their signatures. Nice!

Note that this script works with perf script –s:

$ perf script -s ./perf-script.py
in trace_begin
raw_syscalls__sys_exit     7 94571.445908134    21117 ls                    id=Zero, ret=Zero
raw_syscalls__sys_enter     7 94571.445942946    21117 ls                    id=45, args=���?bc���?�
[…]

Now now we have a template on which to base writing a Python script to parse the occasions of curiosity for reporting system utilization.

perf scripting

The Python scripts generated by perf script –gen-script are usually not straight executable. They should be invoked by perf:

$ perf script –s ./perf-script.py

What’s actually occurring right here?

  1. First, perf begins. The script subcommand’s -s choice signifies that an exterior script shall be used.
  2. perf establishes a Python runtime surroundings.
  3. perf hundreds the required script.
  4. perf runs the script. The script can carry out regular initialization and even deal with command line arguments, though passing the arguments is barely awkward, requiring a -- separator between the arguments for perf and for the script:
    $ perf script -s ./perf-script.py -- --script-arg1 [...]
  5. perf processes every report of the hint file, calling the suitable event-handling perform within the script. Those event-handling capabilities can do no matter they should do.

Utilization

It seems that perf scripting has adequate capabilities for a workable answer. What type of info is required to generate the statistics for system utilization?

  • Task creation (fork, pthread_create)
  • Task termination (exit)
  • Task substitute (exec)
  • Task migration, specific or implicit, and present CPU
  • Task scheduling
  • System calls
  • Hypervisor calls
  • Interrupts

It will be useful to know what portion of time a job spends in varied system calls, dealing with interrupts, or making specific calls out to the hypervisor. Each of those classes of time will be thought of a “state” for the duty, and the strategies of transitioning from one state to a different should be tracked:

The most vital level of the diagram is that there are occasions for every state transition.

  • Task creation: clone system name
  • Task termination: sched:sched_process_exit
  • Task substitute: sched:sched_process_exec
  • Task migration: sched_setaffinity system name (specific), sched:sched_migrate_task (implicit)
  • Task scheduling: sched:sched_switch
  • System calls: raw_syscalls:sys_enter, raw_syscalls:sys_exit
  • Hypervisor calls: (POWER-specific) powerpc:hcall_entry, powerpc:hcall_exit
  • Interrupts: irq:irq_handler_entry, irq:irq_handler_exit

The curt command for Linux

perf gives an acceptable infrastructure with which to seize the required information for system utilization. There are a adequate set of occasions out there for tracing within the Linux kernel. The Python scripting capabilities allow a robust and versatile technique of processing the hint information. It’s time to put in writing the device.

High-level design

In processing every occasion, the related state of the affected duties should be up to date:

  • New job? Create and initialize information constructions to trace the duty’s state
    • Command
    • Process ID
    • Task ID
    • Migration rely (Zero)
    • Current CPU
  • New CPU for this job? Create and initialize information constructions for CPU-specific information
    • User time (Zero)
    • System time (Zero)
    • Hypervisor time (Zero)
    • Interrupt time (Zero)
    • Idle time (Zero)
  • New transaction for this job? Create and initialize information constructions for transaction-specific information
    • Elapsed time (Zero)
    • Count (Zero)
    • Minimum (maxint), most (Zero)
  • Existing job?
    • Accumulate time for the earlier state
    • Transaction ending? Accumulate time for the transaction, alter minimal, most values
  • Set new state
  • Save present time (time present state entered)
  • Migration? Increment migration rely

High-level instance

For a raw_syscalls:sys_enter occasion:

  • If this job has not been seen earlier than, allocate and initialize a brand new job information construction
  • If the CPU is new for this job, allocate and initialize a brand new CPU information construction
  • If this technique name is new for this job, allocate and initialize a brand new name information construction
  • In the duty information construction:
    • Accumulate the time because the final state change in a bucket for the present state (“user”)
    • Set the brand new state (“system”)
    • Save the present timestamp as the beginning of this time interval for the brand new state

Edge instances

sys_exit as a job’s first occasion

If the first occasion within the hint for a job is raw_syscalls:sys_exit:

  • There isn’t any matching raw_syscalls:sys_enter with which to find out the beginning time of this technique name.
  • The accrued time because the begin of the hint was all spent within the system name and must be added to the general elapsed time spent in all calls to this technique name.
  • The elapsed time of this technique name is unknown.
  • It could be inaccurate to account for this elapsed time within the common, minimal, or most statistics for this technique name.

In this case, the device creates a separate bucket referred to as “pending” for time spent within the system name that can’t be accounted for within the common, minimal, or most.

A “pending” bucket is required for all transactional occasions (system calls, hypervisor calls, and interrupts).

sys_enter as a job’s final occasion

Similarly, If the final occasion within the hint for a job is raw_syscalls:sys_enter:

  • There isn’t any matching raw_syscalls:sys_exit with which to find out the top time of this technique name.
  • The accrued time from the beginning of the system name to the top of the hint was all spent within the system name and must be added to the general elapsed time spent in all calls to this technique name.
  • The elapsed time of this technique name is unknown.
  • It could be inaccurate to account for this elapsed time within the common, minimal, or most statistics for this technique name.

This elapsed time can also be accrued within the “pending” bucket.

A “pending” bucket is required for all transactional occasions (system calls, hypervisor calls, and interrupts).

Since this situation can solely be found on the finish of the hint, a ultimate “wrap-up” step is required within the device the place the statistics for all recognized duties are accomplished based mostly on their ultimate states.

Indeterminable state

It is feasible that a very busy job (or a brief hint) won’t ever see an occasion for a job from which the duty’s state will be decided. For instance, if solely sched:sched_switch or sched:sched_task_migrate occasions are seen for a job, it’s inconceivable to find out that job’s state. However, the duty is understood to exist and to be operating.

Since the precise state can’t be decided, the runtime for the duty is accrued in a separate bucket, arbitrarily referred to as “busy-unknown.” For completeness, this time can also be displayed within the ultimate report.

Invisible duties

For very, very busy duties (or a brief hint), it’s attainable that a job was actively operating throughout the complete time the hint was being collected, however no occasions for that job seem within the hint. It was by no means migrated, paused, or pressured to attend.

Such duties can’t be recognized to exist by the device and won’t seem within the report.

curt.py Python courses

Task

  • One per job
  • Holds all task-specific information (command, course of ID, state, CPU, checklist of CPU information constructions [see below], migration rely, lists of per-call information constructions [see below])
  • Maintains job state

Call

  • One per distinctive transaction, per job (for instance, one for the “open” system name, one for the “close” system name, one for IRQ 27, and many others.)
  • Holds call-specific information (e.g., begin timestamp, rely, elapsed time, minimal, most)
  • Allocated as wanted (lazy allocation)
  • Stored inside a job in a Python dictionary listed by the distinctive identifier of the decision (e.g., system name code, IRQ quantity, and many others.)

CPU

  • One per CPU on which this job has been noticed to be operating
  • Holds per-CPU job information (e.g., consumer time, system time, hypervisor name time, interrupt time)
  • Allocated as wanted (lazy allocation)
  • Stored inside a job in a Python dictionary listed by the CPU quantity

curt.py occasion processing instance

As beforehand mentioned, perf script will iterate over all occasions within the hint and name the suitable event-handling perform for every occasion.

A primary try at an event-handling perform for sys_exit, given the high-level instance above, may be:

duties =

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
 
  # convert the a number of timestamp values right into a single worth
  timestamp = nsecs(common_secs, common_nsecs)

  # discover this job's information construction
  attempt:
    job = duties[common_pid]
  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

  if common_cpu not in job.cpus:
    # new CPU!
    job.cpu = common_cpu
    job.cpus[common_cpu] = CPU()

  # compute time spent within the earlier state ('consumer')
  delta = timestamp – job.timestamp
  # accumulate 'consumer' time for this job/CPU
  job.cpus[job.cpu].consumer += delta
  if id not in job.syscalls:
    # new system name for this job!
    job.syscalls[id] = Call()

  # change job's state
  job.mode = 'sys'

  # save the timestamp for the final occasion (this one) for this job
  job.timestamp = timestamp

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the a number of timestamp values right into a single worth
  timestamp = nsecs(common_secs, common_nsecs)

  # get the duty information construction
  job = duties[common_pid]

  # compute elapsed time for this technique name
  delta =  job.timestamp - timestamp

  # accumulate time for this job/system name
  job.syscalls[id].elapsed += delta
  # increment the tally for this job/system name
  job.syscalls[id].rely += 1
  # alter statistics
  if delta < job.syscalls[id].min:
    job.syscalls[id].min = delta
  if delta > job.syscalls[id].max:
    job.syscalls[id].max = delta

  # accumulate time for this job's state on this CPU
  job.cpus[common_cpu].system += delta

  # change job's state
  job.mode = 'consumer'

  # save the timestamp for the final occasion (this one) for this job
  job.timestamp = timestamp

Handling the sting instances

Following are a few of the edge instances which are attainable and should be dealt with.

Sys_exit as first occasion

As a system-wide hint will be began at an arbitrary time, it’s actually attainable that the primary occasion for a job is raw_syscalls:sys_exit. This requires including the identical code for brand spanking new job discovery from the event-handling perform for raw_syscalls:sys_enter to the handler for raw_syscalls:sys_exit. This:

  # get the duty information construction
  job = duties[common_pid]

turns into this:

  # discover this job's information construction
  attempt:
    job = duties[common_pid]
  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

Another situation is that it’s inconceivable to correctly accumulate the info for this technique name since there isn’t a timestamp for the beginning of the system name. The time from the beginning of the hint till this occasion has been spent by this job within the system name. It could be inaccurate to disregard this time. It would even be inaccurate to include this time such that it’s used to compute the typical, minimal, or most. The solely affordable choice is to build up this individually, calling it “pending” system time. To precisely compute this time, the timestamp of the primary occasion of the hint should be recognized. Since any occasion could possibly be the primary occasion within the hint, each occasion should conditionally save its timestamp if it’s the first occasion. A worldwide variable is required:

start_timestamp = Zero

And each event-handling perform should conditionally save its timestamp:

  # convert the a number of timestamp values right into a single worth
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = Zero:
    start_timestamp = timestamp

So, the event-handling perform for raw_syscalls:sys_exit turns into:

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the a number of timestamp values right into a single worth
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = Zero:
    start_timestamp = timestamp

  # discover this job's information construction
  attempt:
    job = duties[common_pid]

    # compute elapsed time for this technique name
    delta =  job.timestamp - timestamp

    # accumulate time for this job/system name
    job.syscalls[id].elapsed += delta
    # increment the tally for this job/system name
    job.syscalls[id].rely += 1
    # alter statistics
    if delta < job.syscalls[id].min:
      job.syscalls[id].min = delta
    if delta > job.syscalls[id].max:
      job.syscalls[id].max = delta

  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

    # compute elapsed time for this technique name
    delta =  start_timestamp - timestamp

    # accumulate time for this job/system name
    job.syscalls[id].pending += delta

  # accumulate time for this job's state on this CPU
  job.cpus[common_cpu].system += delta

  # change job's state
  job.mode = 'consumer'

  # save the timestamp for the final occasion (this one) for this job
  job.timestamp = timestamp

Sys_enter as final occasion

An identical situation to having sys_exit as the primary occasion for a job is when sys_enter is the final occasion seen for a job. The time spent within the system name should be accrued for completeness however cannot precisely affect the typical, minimal, nor most. This time may also be accrued in for a separate “pending” state.

To precisely decide the elapsed time of the pending system name, from sys_entry to the top of the hint interval, the timestamp of the ultimate occasion within the hint file is required. Unfortunately, there isn’t a solution to know which occasion is the final occasion till that occasion has already been processed. So, all occasions should save their respective timestamps in a worldwide variable.

It could also be that many duties are within the state the place the final occasion seen for them was sys_enter. Thus, after the final occasion is processed, a ultimate “wrap up” step is required to finish the statistics for these duties. Fortunately, there’s a trace_end perform which is known as by perf after the ultimate occasion has been processed.

Last, we have to save the id of the system name in each sys_enter.

curr_timestamp = Zero

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):

  # convert the a number of timestamp values right into a single worth
  curr_timestamp = nsecs(common_secs, common_nsecs)
[]
  job.syscall = id
[]

def trace_end():
        for tid in duties.keys():
                job = duties[tid]
                # if this job ended whereas executing a system name
                if job.mode == 'sys':
                        # compute the time from the entry to the system name to the top of the hint interval
                        delta = curr_timestamp - job.timestamp
                        # accumulate the elapsed time for this technique name
                        job.syscalls[job.syscall].pending += delta
                        # accumulate the system time for this job/CPU
                        job.cpus[job.cpu].sys += delta

Migrations

A job migration is when a job operating on one CPU is moved to a different CPU. This can occur by both:

  1. Explicit request (e.g., a name to sched_setaffinity), or
  2. Implicitly by the kernel (e.g., load balancing or vacating a CPU being taken offline)

When detected:

  • The migration rely for the duty must be incremented
  • The statistics for the earlier CPU must be up to date
  • A brand new CPU information construction might should be up to date and initialized if the CPU is new for the duty
  • The job’s present CPU is about to the brand new CPU

For correct statistics, job migrations should be detected as quickly as attainable. The first case, specific request, occurs inside a system name and will be detected within the sys_exit occasion for that system name. The second case has its personal occasion, sched:sched_migrate_task, so it is going to want a brand new event-handling perform.

def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):

  # convert the a number of timestamp values right into a single worth
  timestamp = nsecs(common_secs, common_nsecs)

  If start_timestamp = Zero:
    start_timestamp = timestamp

  # discover this job's information construction
  attempt:
    job = duties[common_pid]

    # compute elapsed time for this technique name
    delta =  job.timestamp - timestamp

    # accumulate time for this job/system name
    job.syscalls[id].elapsed += delta
    # increment the tally for this job/system name
    job.syscalls[id].rely += 1
    # alter statistics
    if delta < job.syscalls[id].min:
      job.syscalls[id].min = delta
    if delta > job.syscalls[id].max:
      job.syscalls[id].max = delta

  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

    job.cpu = common_cpu

    # compute elapsed time for this technique name
    delta =  start_timestamp - timestamp

    # accumulate time for this job/system name
    job.syscalls[id].pending += delta

  If common_cpu != job.cpu:
    job.migrations += 1
    # divide the time spent on this syscall in half...
    delta /= 2
    # and provides must the earlier CPU, under, and half to the brand new CPU, later
    job.cpus[job.cpu].system += delta

  # accumulate time for this job's state on this CPU
  job.cpus[common_cpu].system += delta

  # change job's state
  job.mode = 'consumer'

  # save the timestamp for the final occasion (this one) for this job
  job.timestamp = timestamp

def sched__sched_migrate_task(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, comm, pid, prio, orig_cpu,
        dest_cpu, perf_sample_dict):

  If start_timestamp = Zero:
    start_timestamp = timestamp

  # discover this job's information construction
  attempt:
    job = duties[common_pid]
  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

    job.cpu = common_cpu

    If common_cpu not in job.cpus:
      job.cpus[common_cpu] = CPU()

    job.migrations += 1

Task creation

To precisely accumulate statistics for a job, it’s important to know when the duty is created. Tasks will be created with fork(), which creates a brand new course of, or pthread_create(), which creates a brand new job throughout the similar course of. Fortunately, each are manifested by a clone system name and made evident by a sched:sched_process_fork occasion. The lifetime of the duty begins on the sched_process_fork occasion. The edge case that arises is that the primary possible occasions for the brand new job are:

  1. sched_switch when the brand new job begins operating. The new job must be thought of idle at creation till this occasion happens
  2. sys_exit for the clone system name. The preliminary state of the brand new job must be based mostly on the state of the duty that creates it, together with being throughout the clone system name.

One edge case that should be dealt with is that if the creating job (mother or father) isn’t but recognized, it should be created and initialized, and the presumption is that it has been actively operating because the begin of the hint.

def sched__sched_process_fork(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, parent_comm, parent_pid, child_comm, child_pid):
  world start_timestamp, curr_timestamp
  curr_timestamp = self.timestamp
  if (start_timestamp == Zero):
    start_timestamp = curr_timestamp
  # discover this job's information construction
  attempt:
    job = duties[common_pid]
  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job
  attempt:
    mother or father = duties[self.parent_tid]
  besides:
    # must create mother or father job right here!
    mother or father = Task(start_timestamp, self.command, 'sys', self.pid)
    mother or father.sched_stat = True # ?
    mother or father.cpu = self.cpu
    mother or father.cpus[mother or father.cpu] = CPU()
    duties[self.parent_tid] = mother or father
 
    job.resume_mode = mother or father.mode
    job.syscall = mother or father.syscall
    job.syscalls[job.syscall] = Call()
    job.syscalls[job.syscall].timestamp = self.timestamp

Task exit

Similarly, for full and correct job statistics, it’s important to know when a job has terminated. There’s an occasion for that: sched:sched_process_exit. This one is fairly simple to deal with, in that the hassle is simply to shut out the statistics and set the mode appropriately, so any end-of-trace processing is not going to assume the duty continues to be lively:

def sched__sched_process_exit_old(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, comm, pid, prio):
  world start_timestamp, curr_timestamp
  curr_timestamp = self.timestamp
  if (start_timestamp == Zero):
    start_timestamp = curr_timestamp

  # discover this job's information construction
  attempt:
    job = duties[common_pid]
  besides:
    # new job!
    job = Task()
    # save the command string
    job.comm = common_comm
    job.timestamp = curr_timestamp
    # save the brand new job within the world checklist (dictionary) of duties
    duties[common_pid] = job

  delta = timestamp – job.timestamp
  job.sys += delta
  job.mode = 'exit'

Output

What follows is an instance of the report displayed by curt, barely reformatted to suit on a narrower web page width and with the idle-time classification information (which makes the output very vast) eliminated, and for brevity. Seen are two processes, 1497 and 2857. Process 1497 has two duties, 1497 and 1523. Each job has a per-CPU abstract and system-wide (“ALL” CPUs) abstract. Each job’s information is adopted by the system name information for that job (if any), hypervisor name information (if any), and interrupt information (if any). After every course of’s respective duties is a per-process abstract.  Process 2857 has a job 2857-Zero that’s the earlier job picture earlier than an exec() system name changed the method picture. After all processes is a system-wide abstract.

1497:
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [  1497] X             2  Zero.076354  Zero.019563  Zero.000000  Zero.000000  Zero.000000 15.818719 |   Zero.6%
   [  1497] X           ALL  Zero.076354  Zero.019563  Zero.000000  Zero.000000  Zero.000000 15.818719 |   Zero.6%     Zero
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     (  Zero)learn                 2  Zero.004699     Zero.000000     Zero.002350     Zero.002130     Zero.002569
     (232)epoll_wait           1  9.968375     5.865208     9.968375     9.968375     9.968375
 
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [  1523] InputThread   1  Zero.052598  Zero.037073  Zero.000000  Zero.000000  Zero.000000 15.824965 |   Zero.6%
   [  1523] InputThread ALL  Zero.052598  Zero.037073  Zero.000000  Zero.000000  Zero.000000 15.824965 |   Zero.6%     Zero
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     (  Zero)learn                14  Zero.011773     Zero.000000     Zero.000841     Zero.000509     Zero.002185
     (  1)write                2  Zero.010763     Zero.000000     Zero.005381     Zero.004974     Zero.005789
     (232)epoll_wait           1  9.966649     5.872853     9.966649     9.966649     9.966649
 
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [   ALL]             ALL  Zero.128952  Zero.056636  Zero.000000  Zero.000000  Zero.000000 31.643684 |   Zero.6%     Zero
 
2857:
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [  2857] execs.sh      1  Zero.257617  Zero.249685  Zero.000000  Zero.000000  Zero.000000  Zero.266200 |  65.6%
   [  2857] execs.sh      2  Zero.000000  Zero.023951  Zero.000000  Zero.000000  Zero.000000  Zero.005728 |  80.7%
   [  2857] execs.sh      5  Zero.313509  Zero.062271  Zero.000000  Zero.000000  Zero.000000  Zero.344279 |  52.2%
   [  2857] execs.sh      6  Zero.136623  Zero.128883  Zero.000000  Zero.000000  Zero.000000  Zero.533263 |  33.2%
   [  2857] execs.sh      7  Zero.527347  Zero.194014  Zero.000000  Zero.000000  Zero.000000  Zero.990625 |  42.1%
   [  2857] execs.sh    ALL  1.235096  Zero.658804  Zero.000000  Zero.000000  Zero.000000  2.140095 |  46.9%     Four
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     (  9)mmap                15  Zero.059388     Zero.000000     Zero.003959     Zero.001704     Zero.017919
     ( 14)rt_sigprocmask      12  Zero.006391     Zero.000000     Zero.000533     Zero.000431     Zero.000711
     (  2)open                 9  2.253509     Zero.000000     Zero.250390     Zero.008589     Zero.511953
     (  three)shut                9  Zero.017771     Zero.000000     Zero.001975     Zero.000681     Zero.005245
     (  5)fstat                9  Zero.007911     Zero.000000     Zero.000879     Zero.000683     Zero.001182
     ( 10)mprotect             Eight  Zero.052198     Zero.000000     Zero.006525     Zero.003913     Zero.018073
     ( 13)rt_sigaction         Eight  Zero.004281     Zero.000000     Zero.000535     Zero.000458     Zero.000751
     (  Zero)learn                 7  Zero.197772     Zero.000000     Zero.028253     Zero.000790     Zero.191028
     ( 12)brk                  5  Zero.003766     Zero.000000     Zero.000753     Zero.000425     Zero.001618
     (  Eight)lseek                three  Zero.001766     Zero.000000     Zero.000589     Zero.000469     Zero.000818
 
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [2857-0] perf          6  Zero.053925  Zero.191898  Zero.000000  Zero.000000  Zero.000000  Zero.827263 |  22.9%
   [2857-0] perf          7  Zero.000000  Zero.656423  Zero.000000  Zero.000000  Zero.000000  Zero.484107 |  57.6%
   [2857-0] perf        ALL  Zero.053925  Zero.848321  Zero.000000  Zero.000000  Zero.000000  1.311370 |  40.Eight%     1
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     (  Zero)learn                 Zero  Zero.000000     Zero.167845           --           --           --
     ( 59)execve               Zero  Zero.000000     Zero.000000           --           --           --
 
ALL:
-- [  task] command     cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
   [   ALL]             ALL 10.790803 29.633170  Zero.160165  Zero.000000  Zero.137747 54.449823 |   7.Four%    50
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     (  1)write             2896  1.623985     Zero.000000     Zero.004014     Zero.002364     Zero.041399
     (102)getuid            2081  three.523861     Zero.000000     Zero.001693     Zero.00Zero488     Zero.025157
     (142)sched_setparam     691  7.222906    32.012841     Zero.024925     Zero.002024     Zero.662975
     ( 13)rt_sigaction       383  Zero.235087     Zero.000000     Zero.000614     Zero.000434     Zero.014402
     (  Eight)lseek              281  Zero.169157     Zero.000000     Zero.000602     Zero.000452     Zero.013404
     (  Zero)learn               133  2.782795     Zero.167845     Zero.020923     Zero.000509     1.864439
     (  7)ballot                96  Eight.583354   131.889895     Zero.193577     Zero.000626     Four.596280
     (  Four)stat                93  7.036355     1.058719     Zero.183187     Zero.000981     three.661659
     ( 47)recvmsg             85  Zero.146644     Zero.000000     Zero.001725     Zero.000646     Zero.019067
     (  three)shut               79  Zero.171046     Zero.000000     Zero.002165     Zero.000428     Zero.020659
     (  9)mmap                78  Zero.311233     Zero.000000     Zero.003990     Zero.001613     Zero.017919
     (186)gettid              74  Zero.067315     Zero.000000     Zero.000910     Zero.000403     Zero.014075
     (  2)open                71  three.081589     Zero.213059     Zero.184248     Zero.001921     Zero.937946
     (202)futex               62  5.145112   164.286154     Zero.405566     Zero.000597    11.587437
 
  -- ( ID)identify             rely   elapsed      pending      common      minimal      most
     ( 12)i8042               10  Zero.160165     Zero.000000     Zero.016016     Zero.010920     Zero.032805
 
Total Trace Time: 15.914636 ms

Hurdles and points

Following are a few of the points encountered within the improvement of curt.

Out-of-order occasions

One of the more difficult points is the invention that occasions in a perf.information file will be out of time order. For a program making an attempt to observe state transitions fastidiously, this can be a severe situation. For instance, a hint may embrace the next sequence of occasions, displayed as they seem within the hint file:

time 0000:  sys_enter syscall1
time 0007:  sys_enter syscall2
time 0006:  sys_exit syscall1
time 0009:  sys_exit syscall2

Just blindly processing these occasions within the order they’re introduced to their respective event-handling capabilities (within the mistaken time order) will lead to incorrect statistics (or worse).

The most user-friendly methods to deal with out-of-order occasions embrace:

  • Prevent traces from having out-of-order occasions within the first place by altering the way in which perf report works
  • Providing a way to reorder occasions in a hint file, maybe by enhancing perf inject
  • Modifying how perf script works to current the occasions to the event-handling capabilities in time order

But user-friendly isn’t the identical as easy, nor simple. Also, not one of the above are within the consumer’s management.

I selected to implement a queue for incoming occasions that might be sufficiently deep to permit for correct reordering of all occasions. This required a big redesign of the code, together with implementation of courses for every occasion, and transferring the occasion processing for every occasion sort into a way in that occasion’s class.

In the redesigned code, the precise occasion handlers’ solely job is to save lots of the related information from the occasion into an occasion of the occasion class, queue it, then course of the highest (oldest in time) occasion from the queue:

def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
         occasion = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id)
        process_event(occasion)

The easy reorderable queuing mechanism is in a standard perform:

occasions = []
n_events = Zero
def process_event(occasion):
        world occasions,n_events,curr_timestamp
        i = n_events
        whereas i > Zero and occasions[i-1].timestamp > occasion.timestamp:
                i = i-1
        occasions.insert(i,occasion)
        if n_events < params.window:
                n_events = n_events+1
        else:
                occasion = occasions[Zero]
                # must delete from occasions checklist now,
                # as a result of occasion.course of() may reenter right here
                del occasions[Zero]
                if occasion.timestamp < curr_timestamp:
                        sys.stderr.write("Error: OUT OF ORDER occasions detected.n  Try rising the scale of the look-ahead window with --window=<n>n")
                occasion.course of()

Note that the scale of the queue is configurable, primarily for efficiency and to restrict reminiscence consumption. The perform will report when that queue dimension is inadequate to get rid of out-of-order occasions. It is value contemplating whether or not to think about this case a catastrophic failure and elect to terminate this system.

Implementing a category for every occasion sort led to some consideration for refactoring, such that frequent code may coalesce right into a base class:

class Event (object):
 
        def __init__(self):
                self.timestamp = Zero
                self.cpu = Zero
                self.tid = Zero
                self.command = 'unknown'
                self.mode = 'unknown'
                self.pid = Zero
 
        def course of(self):
                world start_timestamp
 
                attempt:
                        job = duties[self.tid]
                        if job.pid == 'unknown':
                                duties[self.tid].pid = self.pid
                besides:
                        job = Task(start_timestamp, self.command, self.mode, self.pid)
                        duties[self.tid] = job
 
                if self.cpu not in job.cpus:
                        job.cpus[self.cpu] = CPU()
                        if job.cpu == 'unknown':
                                job.cpu = self.cpu
 
                if self.cpu != job.cpu:
                        job.cpu = self.cpu
                        job.migrations += 1
 
                return job

Then a category for every occasion sort could be equally constructed:

class Event_sys_enter ( Event ):
 
        def __init__(self, timestamp, cpu, tid, comm, id, pid):
                self.timestamp = timestamp
                self.cpu = cpu
                self.tid = tid
                self.command = comm
                self.id = id
                self.pid = pid
                self.mode = 'busy-unknown'
               
        def course of(self):
                world start_timestamp, curr_timestamp
                curr_timestamp = self.timestamp
                if (start_timestamp == Zero):
                        start_timestamp = curr_timestamp
 
                job = tremendous(Event_sys_enter, self).course of()
 
                if job.mode == 'busy-unknown':
                        job.mode = 'consumer'
                        for cpu in job.cpus:
                                job.cpus[cpu].consumer = job.cpus[cpu].busy_unknown
                                job.cpus[cpu].busy_unknown = Zero
 
                job.syscall = self.id
                if self.id not in job.syscalls:
                        job.syscalls[self.id] = Call()
 
                job.syscalls[self.id].timestamp = curr_timestamp
                job.change_mode(curr_timestamp, 'sys')

Further refactoring is clear above, as nicely, transferring the frequent code that updates related statistics based mostly on a job’s state change and the state change itself right into a change_mode technique of the Task class.

Start-of-trace timestamp

As talked about above, for scripts that depend upon elapsed time, there must be a neater solution to get the primary timestamp within the hint apart from forcing each event-handling perform to conditionally save its timestamp because the start-of-trace timestamp.

Awkward invocation

The syntax for invoking a perf Python script, together with script parameters, is barely awkward:

$ perf script –s ./curt.py -- --window=80

Also, it is awkward that perf Python scripts are usually not themselves executable.

The curt.py script was made straight executable and can invoke perf, which can in flip invoke the script. Implementation is a bit complicated nevertheless it’s simple to make use of:

$ ./curt.py --window=80

This script should detect when it has been straight invoked. The Python surroundings established by perf is a digital module from which the perf Python scripts import:

attempt:
        from perf_trace_context import *

If this import fails, the script was straight invoked. In this case, the script will exec perf, specifying itself because the script to run, and passing alongside any command line parameters:

besides:
        if len(params.file_or_command) == Zero:
                params.file_or_command = [ "perf.data" ]
        sys.argv = ['perf', 'script', '-i' ] + params.file_or_command + [ '-s', sys.argv[Zero] ]
        sys.argv.append('--')
        sys.argv += ['--window', str(params.window)]
        if params.debug:
                sys.argv.append('--debug')
        sys.argv += ['--api', str(params.api)]
        if params.debug:
                print sys.argv
        os.execvp("perf", sys.argv)
        sys.exit(1)

In this fashion, the script can’t solely be run straight, it will probably nonetheless be run through the use of the perf script command.

Simultaneous occasion registration required

An artifact of the way in which perf allows occasions can result in sudden hint information. For instance, specifying:

$ perf report –a –e raw_syscalls:sys_enter –e raw_syscalls:sys_exit ./command

Will lead to a hint file that begins with the next sequence of occasions for a single job (the perf command itself):

sys_enter
sys_enter
sys_enter

This occurs as a result of perf will register the sys_enter occasion for each CPU on the system (due to the -a argument), then it is going to register the sys_exit occasion for each CPU. In the latter case, because the sys_enter occasion has already been enabled for every CPU, that occasion reveals up within the hint; however because the sys_exit has not been enabled on every CPU till after the decision returns, the sys_exit name doesn’t present up within the hint. The reverse situation occurs on the finish of the hint file, with a sequence of sys_exit occasions within the hint as a result of the sys_enter occasion has already been disabled.

The answer to this situation is to group the occasions, which isn’t nicely documented:

$ perf report –e '' ./command

With this syntax, the sys_enter and sys_exit occasions are enabled concurrently.

Awkward recording step

There are loads of totally different occasions required for computation of the complete set of statistics for duties. This results in a really lengthy, sophisticated command for recording:

$ perf report -e '' -a *command --args*

The answer to this situation is to allow the script to carry out the report step itself, by itself invoking perf. An additional enhancement is to proceed after the recording is full and report the statistics from that recording:

if params.report:
        # [ed. Omitting here the list of events for brevity]
        eventlist = '' + eventlist + '' # group the occasions
        command = ['perf', 'report', '--quiet', '--all-cpus',
                '--event', eventlist ] + params.file_or_command
        if params.debug:
                print command
        subprocess.name(command)

The command syntax required to report and report turns into:

$ ./curt.py --record ./command

Process IDs and perf API change

Process IDs are handled a bit cavalierly by perf scripting. Note nicely above that one of many frequent parameters for the generated event-handling capabilities is known as common_pid. This isn’t the method ID, however the job ID. In reality, on many present Linux-based distributions, there isn’t a solution to decide a job’s course of ID from inside a perf Python script. This presents a significant issue for a script that wishes to compute statistics for a course of.

Fortunately, in Linux kernel v4.14, an extra parameter was offered to every of the event-handling capabilities—perf_sample_dict—a dictionary from which the method ID could possibly be extracted: (perf_sample_dict['sample']['pid']).

Unfortunately, present Linux distributions might not have that model of the Linux kernel. If the script is written to count on that further parameter, the script will fail and report an error:

SortError: irq__irq_handler_exit_new() takes precisely 11 arguments (10 given)

Ideally, a way to robotically uncover if the extra parameter is handed could be out there to allow a script to simply run with each the outdated and new APIs and to make the most of the brand new API whether it is out there. Unfortunately, such a way isn’t readily obvious.

Since there’s clearly worth in utilizing the brand new API to find out process-wide statistics, curt gives a command line choice to make use of the brand new API. curt then takes benefit of Python’s lazy perform binding to regulate, at run-time, which API to make use of:

if params.api == 1:
        dummy_dict =
        dummy_dict['pattern'] =
        dummy_dict['pattern']['pid'] = 'unknown'
        raw_syscalls__sys_enter = raw_syscalls__sys_enter_old
        []
else:
        raw_syscalls__sys_enter = raw_syscalls__sys_enter_new
        []

This requires two capabilities for every occasion:

def raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, perf_sample_dict):
 
        occasion = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id, perf_sample_dict['pattern']['pid'])
        process_event(occasion)
 
def raw_syscalls__sys_enter_old(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
        world dummy_dict
        raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, dummy_dict)

Note that the event-handling perform for the older API will make use of the perform for the newer API, passing a statically outlined dictionary containing simply sufficient information such that accessing it as perf_sample_dict['sample']['pid'] will work (leading to 'unknown').

Events reported on different CPUs

Not all occasions that check with a job are reported from a CPU on which the duty is operating. This may lead to an artificially excessive migration rely and different incorrect statistics. For a lot of these occasions (sched_stat), the occasion CPU is ignored.

Explicit migrations (no sched_migrate occasion)

While there’s conveniently an occasion for when the kernel decides emigrate a job from one CPU to a different, there isn’t a occasion for when the duty requests a migration by itself. These are effected by system calls (sched_setaffinity), so the sys_exit occasion handler should evaluate the occasion CPU to the duty’s CPU, and if totally different, presume a migration has occurred. (This is described above, however repeated right here within the “issues” part for completeness.)

Mapping system name IDs to names is architecture-specific

System calls are recognized in occasions solely as distinctive numeric identifiers. These identifiers are usually not readily interpreted by people within the report. These numeric identifiers are usually not readily mapped to their mnemonics as a result of they’re architecture-specific, and new system calls will be added in newer kernels. Fortunately, perf gives a way to map system name numeric identifiers to system name names. A easy instance follows:

from Util import syscall_name
def raw_syscalls__sys_enter(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, id, args, perf_sample_dict):
                print "%s id=%d" % (syscall_name(id), id)

Unfortunately, utilizing syscall_name introduces a dependency on the audit python bindings. This dependency is being eliminated in upstream variations of perf.

Mapping hypervisor name IDs to names is non-existent

Similar to system calls, hypervisor calls are additionally recognized solely with numeric identifiers. For IBM’s POWER hypervisor, they’re statically outlined. Unfortunately, perf doesn’t present a way to map hypervisor name identifiers to mnemonics. curt features a (hardcoded) perform to do exactly that:

hcall_to_name =
        '0x4':'H_REMOVE',
        '0x8':'H_ENTER',      
        '0xc':'H_READ',      
        '0x10':'H_CLEAR_MOD',
[]

 
def hcall_name(opcode):
        attempt:
                return hcall_to_name[hex(opcode)]
        besides:
                return str(opcode)

Command strings as bytearrays

perf shops command names and string arguments in Python bytearrays. Unfortunately, printing bytearrays in Python prints each character within the bytearray—even when the string is null-terminated. For instance:

$ perf report –a –e 'sched:sched_switch' sleep three
$ perf script –g Python
generated Python script: perf-script.py
$ perf script -s ./perf-script.py
in trace_begin
sched__sched_switch      three 664597.912692243    21223 perf                  prev_comm=perf^@-terminal-^@, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/three^@^@^@^@^@, next_pid=23, next_prio=Zero
[…]

One answer is to truncate the size of those bytearrays based mostly on null termination, as wanted earlier than printing:

def null(ba):
        null = ba.discover('x00')
        if null >= Zero:
                ba = ba[Zero:null]
        return ba

def sched__sched_switch(event_name, context, common_cpu,
        common_secs, common_nsecs, common_pid, common_comm,
        common_callchain, prev_comm, prev_pid, prev_prio, prev_state,
        next_comm, next_pid, next_prio, perf_sample_dict):

                print "prev_comm=%s, prev_pid=%d, prev_prio=%d, "
                "prev_state=%s, next_comm=%s, next_pid=%d, "
                "next_prio=%d" %
                (null(prev_comm), prev_pid, prev_prio,
                flag_str("sched__sched_switch", "prev_state", prev_state),
                null(next_comm), next_pid, next_prio)

Which properly cleans up the output:

sched__sched_switch      three 664597.912692243    21223 perf                  prev_comm=perf, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/three, next_pid=23, next_prio=Zero

Dynamic mappings, like IRQ quantity to call

Dissimilar to system calls and hypervisor calls, interrupt numbers (IRQs) are dynamically assigned by the kernel on demand, so there cannot be a static desk mapping an IRQ quantity to a reputation. Fortunately, perf passes the identify to the occasion’s irq_handler_entry routine. This permits a script to create a dictionary that maps the IRQ quantity to a reputation:

irq_to_name =
def irq__irq_handler_entry_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, irq, identify, perf_sample_dict):
        irq_to_name[irq] = identify
        occasion = Event_irq_handler_entry(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, irq, identify, getpid(perf_sample_dict))
        process_event(occasion)

Somewhat oddly, perf doesn’t go the identify to the irq_handler_exit routine. So, it’s attainable that a hint might solely see an irq_handler_exit for an IRQ and should be capable to tolerate that. Here, as a substitute of mapping the IRQ to a reputation, the IRQ quantity is returned as a string as a substitute:

def irq_name(irq):
        if irq in irq_to_name:
                return irq_to_name[irq]
        return str(irq)

Task Zero

Task Zero reveals up in all places. It’s not an actual job. It’s an alternative to the “idle” state. It’s the duty ID given to the sched_switch occasion handler when the CPU goes to (or coming from) the “idle” state. It’s usually the duty that’s “interrupted” by interrupts. Tracking the statistics for job Zero as if it have been an actual job wouldn’t make sense. Currently, curt ignores job Zero. However, this loses some info, like a while spent in interrupt processing. curt ought to, however presently would not, monitor fascinating (non-idle) time for job Zero.

Spurious sched_migrate_task occasions (similar CPU)

Rarely, a sched_migrate_task occasion happens during which the supply and goal CPUs are the identical. In different phrases, the duty isn’t migrated. To keep away from artificially inflated migration counts, this case should be explicitly ignored:

class Event_sched_migrate_task (Event):
        def course of(self):
[]
                if self.cpu == self.dest_cpu:
                        return

exec

The semantics of the exec system name are that the picture of the present course of is changed by a very new course of picture with out altering the method ID. This is awkward for monitoring the statistics of a course of (actually, a job) based mostly on the method (job) ID. The change is critical sufficient that the statistics for every job must be accrued individually, so the present job’s statistics should be closed out and a brand new set of statistics must be initialized. The problem is that each the outdated and new duties have the identical course of (job) ID. curt addresses this by tagging the duty’s job ID with a numeric suffix:

class Event_sched_process_exec (Event):
  def course of(self):
    world start_timestamp, curr_timestamp
    curr_timestamp = self.timestamp
    if (start_timestamp == Zero):
      start_timestamp = curr_timestamp
 
    job = tremendous(Event_sched_process_exec, self).course of()
 
    new_task = Task(self.timestamp, self.command, job.mode, self.pid)
    new_task.sched_stat = True
    new_task.syscall = job.syscall
    new_task.syscalls[job.syscall] = Call()
    new_task.syscalls[job.syscall].timestamp = self.timestamp
 
    job.change_mode(curr_timestamp, 'exit')
 
    suffix=Zero
    whereas True:
      old_tid = str(self.tid)+"-"+str(suffix)
      if old_tid in duties:
        suffix += 1
      else:
        break
 
    duties[old_tid] = duties[self.tid]
 
    del duties[self.tid]
 
    duties[self.tid] = new_task

This will clearly separate the statistics for the totally different course of photographs. In the instance under, the perf command (job “9614-0”) exec‘d exec.sh (job “9614-1”), which in flip exec‘d itself (job “9614”):

-- [  task] command   cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
    [  9614] execs.sh    Four  1.328238  Zero.485604  Zero.000000  Zero.000000  Zero.000000  2.273230 |  44.Four%
    [  9614] execs.sh    7  Zero.000000  Zero.201266  Zero.000000  Zero.000000  Zero.000000  Zero.003466 |  98.three%
    [  9614] execs.sh  ALL  1.328238  Zero.686870  Zero.000000  Zero.000000  Zero.000000  2.276696 |  47.Zero%     1

-- [  task] command   cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
    [9614-0] perf        three  Zero.000000  Zero.408588  Zero.000000  Zero.000000  Zero.000000  2.298722 |  15.1%
    [9614-0] perf        Four  Zero.059079  Zero.028269  Zero.000000  Zero.000000  Zero.000000  Zero.611355 |  12.5%
    [9614-0] perf        5  Zero.000000  Zero.067626  Zero.000000  Zero.000000  Zero.000000  Zero.004702 |  93.5%
    [9614-0] perf      ALL  Zero.059079  Zero.504483  Zero.000000  Zero.000000  Zero.000000  2.914779 |  16.2%     2
 
-- [  task] command   cpu      consumer       sys       irq        hv      busy      idle |  util% strikes
    [9614-1] execs.sh    three  1.207972  Zero.987433  Zero.000000  Zero.000000  Zero.000000  2.435908 |  47.Four%
    [9614-1] execs.sh    Four  Zero.000000  Zero.341152  Zero.000000  Zero.000000  Zero.000000  Zero.004147 |  98.Eight%
    [9614-1] execs.sh  ALL  1.207972  1.328585  Zero.000000  Zero.000000  Zero.000000  2.440055 |  51.Zero%     1

Distribution help

Surprisingly, there’s presently no help for perf‘s Python bindings in Ubuntu. Follow the saga for extra element.

Limit on variety of traced occasions

As curt will get extra subtle, it’s possible that an increasing number of occasions could also be required to be included within the hint file. perf presently requires one file descriptor per occasion per CPU. This turns into an issue when the utmost variety of open file descriptors isn’t a big a number of of the variety of CPUs on the system. On programs with massive numbers of CPUs, this shortly turns into an issue. For instance, the default most variety of open file descriptors is usually 1,024. An IBM POWER8 system with 4 sockets might have 12 cores per socket and eight threads (CPUs) per core. Such a system has Four * 12 * Eight = 392 CPUs. In that case, perf may hint solely about two occasions! A workaround is to (considerably) enhance the utmost variety of open file descriptors (ulimit –n if the system administrator has configured the laborious limits excessive sufficient; or the administrator can set the bounds greater in /and many others/safety/limits.conf for nofile).

Summary

I hope this text reveals the facility of perf—and particularly the utility and adaptability of the Python scripting enabled with perf—to carry out subtle processing of kernel hint information. Also, it reveals a few of the points and edge instances that may be encountered when the boundaries of such applied sciences are examined.

Please be at liberty to obtain and make use of the curt device described right here, report issues, counsel enhancements, or contribute code of your individual on the curt GitHub page.

Exit mobile version