Modern computer systems are ever rising in efficiency and capability. This issues little if that rising capability isn’t nicely utilized. Following is an outline of the motivation and work behind “curt,” a brand new device for Linux programs for measuring and breaking down system utilization by course of, by job, and by CPU utilizing the perf
command’s Python scripting capabilities.
I had the privilege of presenting this matter at Texas Linux Fest 2018, and right here I’ve gone a bit deeper into the small print, included hyperlinks to additional info, and expanded the scope of my discuss.
System utilization
In discussing computation, let’s start with some assertions:
- Every computational system is equally quick at doing nothing.
- Computational programs have been created to do issues.
- A computational system is healthier at doing issues when it’s doing one thing than when it’s doing nothing.
Modern computational programs have many streams of execution:
- Often, very massive programs are created by actually wiring collectively smaller programs. At IBM, these smaller programs are generally referred to as CECs (quick for Central Electronics Complexes and pronounced “keks”).
- There are a number of sockets for processor modules in every system.
- There are generally a number of chips per socket (within the type of dual-chip modules—DCMs—or multi-chip modules—MCMs).
- There are a number of cores per chip.
- There are a number of threads per core.
In sum, there are doubtlessly 1000’s of execution threads throughout a single computational system.
Ideally, all these execution streams are 100% busy doing helpful work. One measure of utilization for a person execution stream (CPU thread) is the proportion of time that thread has duties scheduled and operating. (Note that I did not say “doing helpful work.” Creating a device that measures helpful work is left as an train for the reader.) By extension, system utilization is the general proportion of time that each one execution streams of a system have duties scheduled and operating. Similarly, utilization will be outlined with respect to a person job. Task utilization is the proportion of the lifetime of the duty that was spent actively operating on any CPU thread. By extension, course of utilization is the collective utilization of its duties.
There are instruments that measure system utilization: uptime
, vmstat
, mpstat
, nmon
, and many others. There are instruments that measure particular person course of utilization: time
. There are usually not many instruments that measure system-wide per-process and per-task utilization. One such command is curt
on AIX. According to IBM’s Knowledge Center: “The curt
command takes an AIX hint file as enter and produces plenty of statistics associated to processor (CPU) utilization and course of/thread/pthread exercise.”
The AIX curt
command reviews system-wide, per-processor, per-process, and per-task statistics for utility processing (consumer time), system calls (system time), hypervisor calls, kernel threads, interrupts, and idle time.
This looks as if a superb mannequin for the same command for a Linux system.
Utilization information
Before beginning to create any instruments for utilization evaluation, you will need to know what information is required. Since utilization is straight associated as to whether a job is actively operating or not, associated scheduling occasions are required: When is the duty made to run, and when is it paused? Tracking on which CPU the duty runs is vital, so migration occasions are required for implicit migrations. There are additionally sure system calls that drive specific migrations. Creation and deletion of duties are clearly vital. Since we need to perceive consumer time, system time, hypervisor time, and interrupt time, occasions that present the transitions between these job states are required.
The Linux kernel accommodates “tracepoints” for all these occasions. It is feasible to allow tracing for these occasions straight within the kernel’s debugfs
filesystem, often mounted at /sys/kernel/debug
, within the tracing
listing (/sys/kernel/debug/tracing
).
An simpler solution to report tracing information is with the Linux perf
command.
The perf command
perf
is a really highly effective userspace command for tracing or counting each hardware and software program occasions.
Software occasions are predefined within the kernel, will be predefined in userspace code, and will be dynamically created (as “probes”) in kernel or userspace code.
perf
can do far more than simply hint and rely, although.
perf stat
The stat
subcommand of perf
will run a command, rely some occasions generally discovered fascinating, and produce a easy report:
Performance counter stats for './load 100000':
90537.006424 task-clock:u (msec) # 1.00Zero CPUs utilized
Zero context-switches:u # Zero.00Zero Okay/sec
Zero cpu-migrations:u # Zero.00Zero Okay/sec
915 page-faults:u # Zero.Zero10 Okay/sec
386,836,206,133 cycles:u # Four.273 GHz (66.67%)
three,488,523,420 stalled-cycles-frontend:u # Zero.90% frontend cycles idle (50.00%)
287,222,191,827 stalled-cycles-backend:u # 74.25% backend cycles idle (50.00%)
291,102,378,513 directions:u # Zero.75 insn per cycle
# Zero.99 stalled cycles per insn (66.67%)
43,730,320,236 branches:u # 483.Zero10 M/sec (50.00%)
822,030,340 branch-misses:u # 1.88% of all branches (50.00%)
90.539972837 seconds time elapsed
perf report, perf report, and perf annotate
For far more fascinating evaluation, the perf
command may also be used to report occasions and knowledge related to the duty state on the time the occasion occurred:
$ perf report ./some-command
[ perf record: Woken up 55 times to write data ]
[ perf record: Captured and wrote 13.973 MB perf.data (366158 samples) ]
$ perf report --stdio --show-nr-samples --percent-limit Four
# Samples: 366Okay of occasion 'cycles:u'
# Event rely (approx.): 388851358382
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ................................................
#
62.31% 228162 load load [.] most important
19.29% 70607 load load [.] sum_add
18.33% 67117 load load [.] sum_sub
This instance reveals a program that spends about 60% of its operating time within the perform most important
and about 20% every in subfunctions sum_sub
and sum_add
. Note that the default occasion utilized by perf report
is “cycles.” Later examples will present the right way to use perf report
with different occasions.
perf report
can additional report runtime statistics by supply code line (if the compilation was carried out with the -g
flag to provide debug info):
$ perf report --stdio --show-nr-samples --percent-limit Four --sort=srcline
# Samples: 366Okay of occasion 'cycles:u'
# Event rely (approx.): 388851358382
#
# Overhead Samples Source:Line
# ........ ............ ...................................
#
19.40% 71031 load.c:58
16.16% 59168 load.c:18
15.11% 55319 load.c:14
13.30% 48690 load.c:66
13.23% 48434 load.c:70
Four.58% 16767 load.c:62
Four.01% 14677 load.c:56
Further, perf annotate
can present statistics for every instruction of this system:
$ perf annotate --stdio
Percent | Source code & Disassembly of load for cycles:u (70607 samples)
------------------------------------------------------------------------------
: 0000000010000774 <sum_add>:
: int sum_add(int sum, int worth)
6.09 : 100007a0: mr r3,r9
Zero.02 : 100007a4: addi r1,r31,64
Zero.03 : 100007a8: ld r31,-Eight(r1)
Zero.00 : 100007ac: blr
(Note: this code isn’t optimized.)
perf prime
Similar to the prime
command, which shows (at a daily replace interval) the processes utilizing essentially the most CPU time, perf prime
will show the capabilities utilizing essentially the most CPU time amongst all processes on the system, a pleasant leap in granularity.
perf checklist
The examples to this point have used the default occasion, run cycles. There are lots of and maybe 1000’s of occasions of various varieties. perf checklist
will present all of them. Following are only a few examples:
$ perf checklist
directions [Hardware event]
context-switches OR cs [Software event]
L1-icache-loads [Hardware cache event]
mem_access OR cpu/mem_access/ [Kernel PMU event]
cache:
pm_data_from_l2
[The processor's data cache was reloaded from local core's L2 due to a demand load]
floating level:
pm_fxu_busy
[fxu0 busy and fxu1 busy]
frontend:
pm_br_mpred_cmpl
[Number of Branch Mispredicts]
reminiscence:
pm_data_from_dmem
[The processor's data cache was reloaded from another chip's memory on the same Node or Group (Distant) due to a demand load]
pm_data_from_lmem
[The processor's data cache was reloaded from the local chip's Memory due to a demand load]
rNNN [Raw hardware event descriptor]
raw_syscalls:sys_enter [Tracepoint event]
syscalls:sys_enter_chmod [Tracepoint event]
sdt_libpthread:pthread_create [SDT event]
Events labeled as Hardware occasion
, Hardware cache occasion
, Kernel PMU occasion
, and most (if not all) of the occasions below the classes like cache
, floating level
, frontend
, and reminiscence
are hardware occasions counted by the hardware and triggered every time a sure rely is reached. Once triggered, an entry is made into the kernel hint buffer with the present state of the related job. Raw hardware occasion
codes are alphanumeric encodings of the hardware occasions. These are largely wanted when the hardware is newer than the kernel and the consumer must allow occasions which are new for that hardware. Users will not often, if ever, want to make use of uncooked occasion codes.
Events labeled Tracepoint occasion
are embedded within the kernel. These are triggered when that part of code is executed by the kernel. There are “syscalls” occasions for each system name supported by the kernel. raw_syscalls
occasions are triggered for each system name. Since there’s a restrict to the variety of occasions being actively traced, the raw_syscalls
occasions could also be extra sensible when numerous system calls should be traced.
Events labeled SDT occasion
are for software-defined tracepoints (SDTs). These will be embedded in utility or library code and enabled as wanted. When enabled, they behave similar to different occasions: When that part of code is executed (by any job being traced on the system), an entry is made within the kernel hint buffer with the present state of the related job. This is a really highly effective functionality that may show very helpful.
perf buildid-cache and perf probe
Enabling SDTs is straightforward. First, make the SDTs for a sure library recognized to perf
:
$ perf buildid-cache -v --add /lib/powerpc64le-linux-gnu/libpthread.so.Zero
$ perf checklist | grep libpthread
[…]
sdt_libpthread:pthread_create [SDT event]
[…]
Then, flip SDT definitions into out there tracepoints:
$ /usr/bin/sudo perf probe sdt_libpthread:pthread_create
Added new occasion:
sdt_libpthread:pthread_create (on %pthread_create in /lib/powerpc64le-linux-gnu/libpthread-2.27.so)
You can now use it in all perf instruments, comparable to:
perf report -e sdt_libpthread:pthread_create -aR sleep 1
$ perf report -a -e sdt_libpthread:pthread_create ./check
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.199 MB perf.data (9 samples) ]
Note that any location in an utility or library will be made right into a tracepoint. To discover capabilities in an utility that may be made into tracepoints, use perf probe
with –funcs
:
$ perf probe –x ./load --funcs
[…]
most important
sum_add
sum_sub
To allow the perform most important
of the ./load
utility as a tracepoint:
/usr/bin/sudo perf probe –x ./load most important
Added new occasion:
probe_load:most important (on most important in /dwelling/computer/initiatives/load-2.1pc/load)
You can now use it in all perf instruments, comparable to:
perf report –e probe_load:most important –aR sleep 1
$ perf checklist | grep load:most important
probe_load:most important [Tracepoint event]
$ perf report –e probe_load:most important ./load
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.024 MB perf.data (1 samples) ]
perf script
Continuing the earlier instance, perf script
can be utilized to stroll by the perf.information
file and output the contents of every report:
$ perf script
Load 16356 [004] 80526.760310: probe_load:most important: (4006a2)
Processing perf hint information
The previous dialogue and examples present that perf
can accumulate the info required for system utilization evaluation. However, how can that information be processed to provide the specified outcomes?
perf eBPF
A comparatively new and rising know-how with perf
is known as eBPF. BPF is an acronym for Berkeley Packet Filter, and it’s a C-like language initially for, not surprisingly, community packet filtering within the kernel. eBPF is an acronym for prolonged BPF, an identical, however extra strong C-like language based mostly on BPF.
Recent variations of perf
can be utilized to include compiled eBPF code into the kernel to securely and intelligently deal with occasions for any variety of functions, with some limitations.
The functionality could be very highly effective and fairly helpful for real-time, steady updates of event-related information and statistics.
However, as this functionality is rising, help is combined on present releases of Linux distributions. It’s a bit sophisticated (or, put otherwise, I’ve not figured it out but). It’s additionally just for on-line use; there isn’t a offline functionality. For these causes, I will not cowl it additional right here.
perf information file
perf report
produces a perf.information
file. The file is a structured binary file, isn’t significantly nicely documented, has no programming interface for entry, and is unclear on what compatibility ensures exist. For these causes, I selected to not straight use the perf.information
file.
perf script
One of the final examples above confirmed how perf script
is used for strolling by the perf.information
file and emitting fundamental details about every report there. This is an applicable mannequin for what could be wanted to course of the file and monitor the state adjustments and compute the statistics required for system utilization evaluation.
perf script
has a number of modes of operation, together with a number of higher-level scripts that include perf
that produce statistics based mostly on the hint information in a perf.information
file.
$ perf script -l
List of accessible hint scripts:
rw-by-pid system-wide r/w exercise
rwtop [interval] system-wide r/w prime
wakeup-latency system-wide min/max/avg wakeup latency
failed-syscalls [comm] system-wide failed syscalls
rw-by-file <comm> r/w exercise for a program, by file
failed-syscalls-by-pid [comm] system-wide failed syscalls, by pid
intel-pt-events print Intel PT Power Events and PTWRITE
syscall-counts-by-pid [comm] system-wide syscall counts, by pid
export-to-sqlite [database name] [columns] [calls] export perf information to a sqlite3 database
futex-contention futext competition measurement
sctop [comm] [interval] syscall prime
event_analyzing_sample analyze all perf samples
net_dropmonitor show a desk of dropped frames
compaction-times [-h] [-u] [-p|-pv] [-t | [-m] [-fs] [-ms]] [pid|pid-range|comm-regex] show time taken by mm compaction
export-to-postgresql [database name] [columns] [calls] export perf information to a postgresql database
stackcollapse produce callgraphs briefly kind for scripting use
netdev-times [tx] [rx] [dev=] [debug] show a strategy of packet and processing time
syscall-counts [comm] system-wide syscall counts
sched-migration sched migration overview
$ perf script failed-syscalls-by-pid /bin/ls
syscall errors:
comm [pid] rely
------------------------------ ----------
ls [18683]
syscall: entry
err = ENOENT 1
syscall: statfs
err = ENOENT 1
syscall: ioctl
err = ENOTTY three
What do these scripts appear like? Let’s discover out.
$ find failed-syscalls-by-pid
/usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
[…]
$ rpm –qf /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
perf-Four.14.Zero-46.el7a.x86_64
$ $ ls /usr/libexec/perf-core/scripts
perl python
$ perf script -s lang
Scripting language extensions (utilized in perf script -s [spec:]script.[spec]):
Perl [Perl]
pl [Perl]
Python [Python]
py [Python]
So, these scripts include perf
, and each Python and Perl are supported languages.
Note that for the whole thing of this content material, I’ll refer completely to Python.
perf scripts
How do these scripts do what they do? Here are vital extracts from /usr/libexec/perf-core/scripts/python/failed-syscalls-by-pid.py
:
def raw_syscalls__sys_exit(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, id, ret):
[…]
if ret < Zero:
[…]
syscalls[common_comm][common_pid][id][ret] += 1
The perform raw_syscalls__sys_exit
has parameters for all the info for the related occasion. The remainder of the perform solely increments a counter related to the command, course of ID, and system name. The remainder of the code would not try this a lot. Most of the complexity is within the perform signature for the event-handling routine.
Fortunately, perf
makes it simple to determine the correct signatures for varied tracepoint event-handling capabilities.
perf script –gen-script
For the raw_syscalls
occasions, we are able to generate a hint containing simply these occasions:
$ perf checklist | grep raw_syscalls
raw_syscalls:sys_enter [Tracepoint event]
raw_syscalls:sys_exit [Tracepoint event]
$ perf report -e 'raw_syscalls:*' /bin/ls >/dev/null
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.025 MB perf.data (176 samples) ]
We can then have perf
generate a script that accommodates pattern implementations of event-handling capabilities for the occasions within the perf.information
file:
$ perf script --gen-script python
generated Python script: perf-script.py
What do we discover within the script?
def raw_syscalls__sys_exit(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, id, ret):
[…]
def raw_syscalls__sys_enter(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, id, args):
Both event-handling capabilities are specified with their signatures. Nice!
Note that this script works with perf script –s
:
$ perf script -s ./perf-script.py
in trace_begin
raw_syscalls__sys_exit 7 94571.445908134 21117 ls id=Zero, ret=Zero
raw_syscalls__sys_enter 7 94571.445942946 21117 ls id=45, args=���?bc���?�
[…]
Now now we have a template on which to base writing a Python script to parse the occasions of curiosity for reporting system utilization.
perf scripting
The Python scripts generated by perf script –gen-script
are usually not straight executable. They should be invoked by perf
:
$ perf script –s ./perf-script.py
What’s actually occurring right here?
- First,
perf
begins. Thescript
subcommand’s-s
choice signifies that an exterior script shall be used. perf
establishes a Python runtime surroundings.perf
hundreds the required script.perf
runs the script. The script can carry out regular initialization and even deal with command line arguments, though passing the arguments is barely awkward, requiring a--
separator between the arguments forperf
and for the script:$ perf script -s ./perf-script.py -- --script-arg1 [...]
perf
processes every report of the hint file, calling the suitable event-handling perform within the script. Those event-handling capabilities can do no matter they should do.
Utilization
It seems that perf
scripting has adequate capabilities for a workable answer. What type of info is required to generate the statistics for system utilization?
- Task creation (
fork
,pthread_create
) - Task termination (
exit
) - Task substitute (
exec
) - Task migration, specific or implicit, and present CPU
- Task scheduling
- System calls
- Hypervisor calls
- Interrupts
It will be useful to know what portion of time a job spends in varied system calls, dealing with interrupts, or making specific calls out to the hypervisor. Each of those classes of time will be thought of a “state” for the duty, and the strategies of transitioning from one state to a different should be tracked:
The most vital level of the diagram is that there are occasions for every state transition.
- Task creation:
clone
system name - Task termination:
sched:sched_process_exit
- Task substitute:
sched:sched_process_exec
- Task migration:
sched_setaffinity
system name (specific),sched:sched_migrate_task
(implicit) - Task scheduling:
sched:sched_switch
- System calls:
raw_syscalls:sys_enter
,raw_syscalls:sys_exit
- Hypervisor calls: (POWER-specific)
powerpc:hcall_entry
,powerpc:hcall_exit
- Interrupts:
irq:irq_handler_entry
,irq:irq_handler_exit
The curt command for Linux
perf
gives an acceptable infrastructure with which to seize the required information for system utilization. There are a adequate set of occasions out there for tracing within the Linux kernel. The Python scripting capabilities allow a robust and versatile technique of processing the hint information. It’s time to put in writing the device.
High-level design
In processing every occasion, the related state of the affected duties should be up to date:
- New job? Create and initialize information constructions to trace the duty’s state
- Command
- Process ID
- Task ID
- Migration rely (Zero)
- Current CPU
- New CPU for this job? Create and initialize information constructions for CPU-specific information
- User time (Zero)
- System time (Zero)
- Hypervisor time (Zero)
- Interrupt time (Zero)
- Idle time (Zero)
- New transaction for this job? Create and initialize information constructions for transaction-specific information
- Elapsed time (Zero)
- Count (Zero)
- Minimum (maxint), most (Zero)
- Existing job?
- Accumulate time for the earlier state
- Transaction ending? Accumulate time for the transaction, alter minimal, most values
- Set new state
- Save present time (time present state entered)
- Migration? Increment migration rely
High-level instance
For a raw_syscalls:sys_enter
occasion:
- If this job has not been seen earlier than, allocate and initialize a brand new job information construction
- If the CPU is new for this job, allocate and initialize a brand new CPU information construction
- If this technique name is new for this job, allocate and initialize a brand new name information construction
- In the duty information construction:
- Accumulate the time because the final state change in a bucket for the present state (“user”)
- Set the brand new state (“system”)
- Save the present timestamp as the beginning of this time interval for the brand new state
Edge instances
sys_exit as a job’s first occasion
If the first occasion within the hint for a job is raw_syscalls:sys_exit
:
- There isn’t any matching
raw_syscalls:sys_enter
with which to find out the beginning time of this technique name. - The accrued time because the begin of the hint was all spent within the system name and must be added to the general elapsed time spent in all calls to this technique name.
- The elapsed time of this technique name is unknown.
- It could be inaccurate to account for this elapsed time within the common, minimal, or most statistics for this technique name.
In this case, the device creates a separate bucket referred to as “pending” for time spent within the system name that can’t be accounted for within the common, minimal, or most.
A “pending” bucket is required for all transactional occasions (system calls, hypervisor calls, and interrupts).
sys_enter as a job’s final occasion
Similarly, If the final occasion within the hint for a job is raw_syscalls:sys_enter
:
- There isn’t any matching
raw_syscalls:sys_exit
with which to find out the top time of this technique name. - The accrued time from the beginning of the system name to the top of the hint was all spent within the system name and must be added to the general elapsed time spent in all calls to this technique name.
- The elapsed time of this technique name is unknown.
- It could be inaccurate to account for this elapsed time within the common, minimal, or most statistics for this technique name.
This elapsed time can also be accrued within the “pending” bucket.
A “pending” bucket is required for all transactional occasions (system calls, hypervisor calls, and interrupts).
Since this situation can solely be found on the finish of the hint, a ultimate “wrap-up” step is required within the device the place the statistics for all recognized duties are accomplished based mostly on their ultimate states.
Indeterminable state
It is feasible that a very busy job (or a brief hint) won’t ever see an occasion for a job from which the duty’s state will be decided. For instance, if solely sched:sched_switch
or sched:sched_task_migrate
occasions are seen for a job, it’s inconceivable to find out that job’s state. However, the duty is understood to exist and to be operating.
Since the precise state can’t be decided, the runtime for the duty is accrued in a separate bucket, arbitrarily referred to as “busy-unknown.” For completeness, this time can also be displayed within the ultimate report.
Invisible duties
For very, very busy duties (or a brief hint), it’s attainable that a job was actively operating throughout the complete time the hint was being collected, however no occasions for that job seem within the hint. It was by no means migrated, paused, or pressured to attend.
Such duties can’t be recognized to exist by the device and won’t seem within the report.
curt.py Python courses
Task
- One per job
- Holds all task-specific information (command, course of ID, state, CPU, checklist of CPU information constructions [see below], migration rely, lists of per-call information constructions [see below])
- Maintains job state
Call
- One per distinctive transaction, per job (for instance, one for the “open” system name, one for the “close” system name, one for IRQ 27, and many others.)
- Holds call-specific information (e.g., begin timestamp, rely, elapsed time, minimal, most)
- Allocated as wanted (lazy allocation)
- Stored inside a job in a Python dictionary listed by the distinctive identifier of the decision (e.g., system name code, IRQ quantity, and many others.)
CPU
- One per CPU on which this job has been noticed to be operating
- Holds per-CPU job information (e.g., consumer time, system time, hypervisor name time, interrupt time)
- Allocated as wanted (lazy allocation)
- Stored inside a job in a Python dictionary listed by the CPU quantity
curt.py occasion processing instance
As beforehand mentioned, perf script
will iterate over all occasions within the hint and name the suitable event-handling perform for every occasion.
A primary try at an event-handling perform for sys_exit
, given the high-level instance above, may be:
duties =def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
# convert the a number of timestamp values right into a single worth
timestamp = nsecs(common_secs, common_nsecs)# discover this job's information construction
attempt:
job = duties[common_pid]
besides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = jobif common_cpu not in job.cpus:
# new CPU!
job.cpu = common_cpu
job.cpus[common_cpu] = CPU()# compute time spent within the earlier state ('consumer')
delta = timestamp – job.timestamp
# accumulate 'consumer' time for this job/CPU
job.cpus[job.cpu].consumer += delta
if id not in job.syscalls:
# new system name for this job!
job.syscalls[id] = Call()# change job's state
job.mode = 'sys'# save the timestamp for the final occasion (this one) for this job
job.timestamp = timestampdef raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):
# convert the a number of timestamp values right into a single worth
timestamp = nsecs(common_secs, common_nsecs)# get the duty information construction
job = duties[common_pid]# compute elapsed time for this technique name
delta = job.timestamp - timestamp# accumulate time for this job/system name
job.syscalls[id].elapsed += delta
# increment the tally for this job/system name
job.syscalls[id].rely += 1
# alter statistics
if delta < job.syscalls[id].min:
job.syscalls[id].min = delta
if delta > job.syscalls[id].max:
job.syscalls[id].max = delta# accumulate time for this job's state on this CPU
job.cpus[common_cpu].system += delta# change job's state
job.mode = 'consumer'# save the timestamp for the final occasion (this one) for this job
job.timestamp = timestamp
Handling the sting instances
Following are a few of the edge instances which are attainable and should be dealt with.
Sys_exit as first occasion
As a system-wide hint will be began at an arbitrary time, it’s actually attainable that the primary occasion for a job is raw_syscalls:sys_exit
. This requires including the identical code for brand spanking new job discovery from the event-handling perform for raw_syscalls:sys_enter
to the handler for raw_syscalls:sys_exit
. This:
# get the duty information construction
job = duties[common_pid]
turns into this:
# discover this job's information construction
attempt:
job = duties[common_pid]
besides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = job
Another situation is that it’s inconceivable to correctly accumulate the info for this technique name since there isn’t a timestamp for the beginning of the system name. The time from the beginning of the hint till this occasion has been spent by this job within the system name. It could be inaccurate to disregard this time. It would even be inaccurate to include this time such that it’s used to compute the typical, minimal, or most. The solely affordable choice is to build up this individually, calling it “pending” system time. To precisely compute this time, the timestamp of the primary occasion of the hint should be recognized. Since any occasion could possibly be the primary occasion within the hint, each occasion should conditionally save its timestamp if it’s the first occasion. A worldwide variable is required:
start_timestamp = Zero
And each event-handling perform should conditionally save its timestamp:
# convert the a number of timestamp values right into a single worth
timestamp = nsecs(common_secs, common_nsecs)If start_timestamp = Zero:
start_timestamp = timestamp
So, the event-handling perform for raw_syscalls:sys_exit
turns into:
def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):# convert the a number of timestamp values right into a single worth
timestamp = nsecs(common_secs, common_nsecs)If start_timestamp = Zero:
start_timestamp = timestamp# discover this job's information construction
attempt:
job = duties[common_pid]# compute elapsed time for this technique name
delta = job.timestamp - timestamp# accumulate time for this job/system name
job.syscalls[id].elapsed += delta
# increment the tally for this job/system name
job.syscalls[id].rely += 1
# alter statistics
if delta < job.syscalls[id].min:
job.syscalls[id].min = delta
if delta > job.syscalls[id].max:
job.syscalls[id].max = deltabesides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = job# compute elapsed time for this technique name
delta = start_timestamp - timestamp# accumulate time for this job/system name
job.syscalls[id].pending += delta# accumulate time for this job's state on this CPU
job.cpus[common_cpu].system += delta# change job's state
job.mode = 'consumer'# save the timestamp for the final occasion (this one) for this job
job.timestamp = timestamp
Sys_enter as final occasion
An identical situation to having sys_exit
as the primary occasion for a job is when sys_enter
is the final occasion seen for a job. The time spent within the system name should be accrued for completeness however cannot precisely affect the typical, minimal, nor most. This time may also be accrued in for a separate “pending” state.
To precisely decide the elapsed time of the pending system name, from sys_entry
to the top of the hint interval, the timestamp of the ultimate occasion within the hint file is required. Unfortunately, there isn’t a solution to know which occasion is the final occasion till that occasion has already been processed. So, all occasions should save their respective timestamps in a worldwide variable.
It could also be that many duties are within the state the place the final occasion seen for them was sys_enter
. Thus, after the final occasion is processed, a ultimate “wrap up” step is required to finish the statistics for these duties. Fortunately, there’s a trace_end
perform which is known as by perf
after the ultimate occasion has been processed.
Last, we have to save the id
of the system name in each sys_enter
.
curr_timestamp = Zerodef raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
# convert the a number of timestamp values right into a single worth
curr_timestamp = nsecs(common_secs, common_nsecs)
[…]
job.syscall = id
[…]def trace_end():
for tid in duties.keys():
job = duties[tid]
# if this job ended whereas executing a system name
if job.mode == 'sys':
# compute the time from the entry to the system name to the top of the hint interval
delta = curr_timestamp - job.timestamp
# accumulate the elapsed time for this technique name
job.syscalls[job.syscall].pending += delta
# accumulate the system time for this job/CPU
job.cpus[job.cpu].sys += delta
Migrations
A job migration is when a job operating on one CPU is moved to a different CPU. This can occur by both:
- Explicit request (e.g., a name to
sched_setaffinity
), or - Implicitly by the kernel (e.g., load balancing or vacating a CPU being taken offline)
When detected:
- The migration rely for the duty must be incremented
- The statistics for the earlier CPU must be up to date
- A brand new CPU information construction might should be up to date and initialized if the CPU is new for the duty
- The job’s present CPU is about to the brand new CPU
For correct statistics, job migrations should be detected as quickly as attainable. The first case, specific request, occurs inside a system name and will be detected within the sys_exit
occasion for that system name. The second case has its personal occasion, sched:sched_migrate_task
, so it is going to want a brand new event-handling perform.
def raw_syscalls__sys_exit(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, ret):# convert the a number of timestamp values right into a single worth
timestamp = nsecs(common_secs, common_nsecs)If start_timestamp = Zero:
start_timestamp = timestamp# discover this job's information construction
attempt:
job = duties[common_pid]# compute elapsed time for this technique name
delta = job.timestamp - timestamp# accumulate time for this job/system name
job.syscalls[id].elapsed += delta
# increment the tally for this job/system name
job.syscalls[id].rely += 1
# alter statistics
if delta < job.syscalls[id].min:
job.syscalls[id].min = delta
if delta > job.syscalls[id].max:
job.syscalls[id].max = deltabesides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = jobjob.cpu = common_cpu
# compute elapsed time for this technique name
delta = start_timestamp - timestamp# accumulate time for this job/system name
job.syscalls[id].pending += deltaIf common_cpu != job.cpu:
job.migrations += 1
# divide the time spent on this syscall in half...
delta /= 2
# and provides must the earlier CPU, under, and half to the brand new CPU, later
job.cpus[job.cpu].system += delta# accumulate time for this job's state on this CPU
job.cpus[common_cpu].system += delta# change job's state
job.mode = 'consumer'# save the timestamp for the final occasion (this one) for this job
job.timestamp = timestampdef sched__sched_migrate_task(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, comm, pid, prio, orig_cpu,
dest_cpu, perf_sample_dict):If start_timestamp = Zero:
start_timestamp = timestamp# discover this job's information construction
attempt:
job = duties[common_pid]
besides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = jobjob.cpu = common_cpu
If common_cpu not in job.cpus:
job.cpus[common_cpu] = CPU()job.migrations += 1
Task creation
To precisely accumulate statistics for a job, it’s important to know when the duty is created. Tasks will be created with fork()
, which creates a brand new course of, or pthread_create()
, which creates a brand new job throughout the similar course of. Fortunately, each are manifested by a clone
system name and made evident by a sched:sched_process_fork
occasion. The lifetime of the duty begins on the sched_process_fork
occasion. The edge case that arises is that the primary possible occasions for the brand new job are:
sched_switch
when the brand new job begins operating. The new job must be thought of idle at creation till this occasion happenssys_exit
for theclone
system name. The preliminary state of the brand new job must be based mostly on the state of the duty that creates it, together with being throughout theclone
system name.
One edge case that should be dealt with is that if the creating job (mother or father) isn’t but recognized, it should be created and initialized, and the presumption is that it has been actively operating because the begin of the hint.
def sched__sched_process_fork(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, parent_comm, parent_pid, child_comm, child_pid):
world start_timestamp, curr_timestamp
curr_timestamp = self.timestamp
if (start_timestamp == Zero):
start_timestamp = curr_timestamp
# discover this job's information construction
attempt:
job = duties[common_pid]
besides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = job
attempt:
mother or father = duties[self.parent_tid]
besides:
# must create mother or father job right here!
mother or father = Task(start_timestamp, self.command, 'sys', self.pid)
mother or father.sched_stat = True # ?
mother or father.cpu = self.cpu
mother or father.cpus[mother or father.cpu] = CPU()
duties[self.parent_tid] = mother or father
job.resume_mode = mother or father.mode
job.syscall = mother or father.syscall
job.syscalls[job.syscall] = Call()
job.syscalls[job.syscall].timestamp = self.timestamp
Task exit
Similarly, for full and correct job statistics, it’s important to know when a job has terminated. There’s an occasion for that: sched:sched_process_exit
. This one is fairly simple to deal with, in that the hassle is simply to shut out the statistics and set the mode appropriately, so any end-of-trace processing is not going to assume the duty continues to be lively:
def sched__sched_process_exit_old(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, comm, pid, prio):
world start_timestamp, curr_timestamp
curr_timestamp = self.timestamp
if (start_timestamp == Zero):
start_timestamp = curr_timestamp# discover this job's information construction
attempt:
job = duties[common_pid]
besides:
# new job!
job = Task()
# save the command string
job.comm = common_comm
job.timestamp = curr_timestamp
# save the brand new job within the world checklist (dictionary) of duties
duties[common_pid] = jobdelta = timestamp – job.timestamp
job.sys += delta
job.mode = 'exit'
Output
What follows is an instance of the report displayed by curt
, barely reformatted to suit on a narrower web page width and with the idle-time classification information (which makes the output very vast) eliminated, and for brevity. Seen are two processes, 1497 and 2857. Process 1497 has two duties, 1497 and 1523. Each job has a per-CPU abstract and system-wide (“ALL” CPUs) abstract. Each job’s information is adopted by the system name information for that job (if any), hypervisor name information (if any), and interrupt information (if any). After every course of’s respective duties is a per-process abstract. Process 2857 has a job 2857-Zero that’s the earlier job picture earlier than an exec() system name changed the method picture. After all processes is a system-wide abstract.
1497:
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ 1497] X 2 Zero.076354 Zero.019563 Zero.000000 Zero.000000 Zero.000000 15.818719 | Zero.6%
[ 1497] X ALL Zero.076354 Zero.019563 Zero.000000 Zero.000000 Zero.000000 15.818719 | Zero.6% Zero
-- ( ID)identify rely elapsed pending common minimal most
( Zero)learn 2 Zero.004699 Zero.000000 Zero.002350 Zero.002130 Zero.002569
(232)epoll_wait 1 9.968375 5.865208 9.968375 9.968375 9.968375
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ 1523] InputThread 1 Zero.052598 Zero.037073 Zero.000000 Zero.000000 Zero.000000 15.824965 | Zero.6%
[ 1523] InputThread ALL Zero.052598 Zero.037073 Zero.000000 Zero.000000 Zero.000000 15.824965 | Zero.6% Zero
-- ( ID)identify rely elapsed pending common minimal most
( Zero)learn 14 Zero.011773 Zero.000000 Zero.000841 Zero.000509 Zero.002185
( 1)write 2 Zero.010763 Zero.000000 Zero.005381 Zero.004974 Zero.005789
(232)epoll_wait 1 9.966649 5.872853 9.966649 9.966649 9.966649
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ ALL] ALL Zero.128952 Zero.056636 Zero.000000 Zero.000000 Zero.000000 31.643684 | Zero.6% Zero
2857:
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ 2857] execs.sh 1 Zero.257617 Zero.249685 Zero.000000 Zero.000000 Zero.000000 Zero.266200 | 65.6%
[ 2857] execs.sh 2 Zero.000000 Zero.023951 Zero.000000 Zero.000000 Zero.000000 Zero.005728 | 80.7%
[ 2857] execs.sh 5 Zero.313509 Zero.062271 Zero.000000 Zero.000000 Zero.000000 Zero.344279 | 52.2%
[ 2857] execs.sh 6 Zero.136623 Zero.128883 Zero.000000 Zero.000000 Zero.000000 Zero.533263 | 33.2%
[ 2857] execs.sh 7 Zero.527347 Zero.194014 Zero.000000 Zero.000000 Zero.000000 Zero.990625 | 42.1%
[ 2857] execs.sh ALL 1.235096 Zero.658804 Zero.000000 Zero.000000 Zero.000000 2.140095 | 46.9% Four
-- ( ID)identify rely elapsed pending common minimal most
( 9)mmap 15 Zero.059388 Zero.000000 Zero.003959 Zero.001704 Zero.017919
( 14)rt_sigprocmask 12 Zero.006391 Zero.000000 Zero.000533 Zero.000431 Zero.000711
( 2)open 9 2.253509 Zero.000000 Zero.250390 Zero.008589 Zero.511953
( three)shut 9 Zero.017771 Zero.000000 Zero.001975 Zero.000681 Zero.005245
( 5)fstat 9 Zero.007911 Zero.000000 Zero.000879 Zero.000683 Zero.001182
( 10)mprotect Eight Zero.052198 Zero.000000 Zero.006525 Zero.003913 Zero.018073
( 13)rt_sigaction Eight Zero.004281 Zero.000000 Zero.000535 Zero.000458 Zero.000751
( Zero)learn 7 Zero.197772 Zero.000000 Zero.028253 Zero.000790 Zero.191028
( 12)brk 5 Zero.003766 Zero.000000 Zero.000753 Zero.000425 Zero.001618
( Eight)lseek three Zero.001766 Zero.000000 Zero.000589 Zero.000469 Zero.000818
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[2857-0] perf 6 Zero.053925 Zero.191898 Zero.000000 Zero.000000 Zero.000000 Zero.827263 | 22.9%
[2857-0] perf 7 Zero.000000 Zero.656423 Zero.000000 Zero.000000 Zero.000000 Zero.484107 | 57.6%
[2857-0] perf ALL Zero.053925 Zero.848321 Zero.000000 Zero.000000 Zero.000000 1.311370 | 40.Eight% 1
-- ( ID)identify rely elapsed pending common minimal most
( Zero)learn Zero Zero.000000 Zero.167845 -- -- --
( 59)execve Zero Zero.000000 Zero.000000 -- -- --
ALL:
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ ALL] ALL 10.790803 29.633170 Zero.160165 Zero.000000 Zero.137747 54.449823 | 7.Four% 50
-- ( ID)identify rely elapsed pending common minimal most
( 1)write 2896 1.623985 Zero.000000 Zero.004014 Zero.002364 Zero.041399
(102)getuid 2081 three.523861 Zero.000000 Zero.001693 Zero.00Zero488 Zero.025157
(142)sched_setparam 691 7.222906 32.012841 Zero.024925 Zero.002024 Zero.662975
( 13)rt_sigaction 383 Zero.235087 Zero.000000 Zero.000614 Zero.000434 Zero.014402
( Eight)lseek 281 Zero.169157 Zero.000000 Zero.000602 Zero.000452 Zero.013404
( Zero)learn 133 2.782795 Zero.167845 Zero.020923 Zero.000509 1.864439
( 7)ballot 96 Eight.583354 131.889895 Zero.193577 Zero.000626 Four.596280
( Four)stat 93 7.036355 1.058719 Zero.183187 Zero.000981 three.661659
( 47)recvmsg 85 Zero.146644 Zero.000000 Zero.001725 Zero.000646 Zero.019067
( three)shut 79 Zero.171046 Zero.000000 Zero.002165 Zero.000428 Zero.020659
( 9)mmap 78 Zero.311233 Zero.000000 Zero.003990 Zero.001613 Zero.017919
(186)gettid 74 Zero.067315 Zero.000000 Zero.000910 Zero.000403 Zero.014075
( 2)open 71 three.081589 Zero.213059 Zero.184248 Zero.001921 Zero.937946
(202)futex 62 5.145112 164.286154 Zero.405566 Zero.000597 11.587437
-- ( ID)identify rely elapsed pending common minimal most
( 12)i8042 10 Zero.160165 Zero.000000 Zero.016016 Zero.010920 Zero.032805
Total Trace Time: 15.914636 ms
Hurdles and points
Following are a few of the points encountered within the improvement of curt
.
Out-of-order occasions
One of the more difficult points is the invention that occasions in a perf.information
file will be out of time order. For a program making an attempt to observe state transitions fastidiously, this can be a severe situation. For instance, a hint may embrace the next sequence of occasions, displayed as they seem within the hint file:
time 0000: sys_enter syscall1
time 0007: sys_enter syscall2
time 0006: sys_exit syscall1
time 0009: sys_exit syscall2
Just blindly processing these occasions within the order they’re introduced to their respective event-handling capabilities (within the mistaken time order) will lead to incorrect statistics (or worse).
The most user-friendly methods to deal with out-of-order occasions embrace:
- Prevent traces from having out-of-order occasions within the first place by altering the way in which
perf report
works - Providing a way to reorder occasions in a hint file, maybe by enhancing
perf inject
- Modifying how
perf script
works to current the occasions to the event-handling capabilities in time order
But user-friendly isn’t the identical as easy, nor simple. Also, not one of the above are within the consumer’s management.
I selected to implement a queue for incoming occasions that might be sufficiently deep to permit for correct reordering of all occasions. This required a big redesign of the code, together with implementation of courses for every occasion, and transferring the occasion processing for every occasion sort into a way in that occasion’s class.
In the redesigned code, the precise occasion handlers’ solely job is to save lots of the related information from the occasion into an occasion of the occasion class, queue it, then course of the highest (oldest in time) occasion from the queue:
def raw_syscalls__sys_enter(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
occasion = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id)
process_event(occasion)
The easy reorderable queuing mechanism is in a standard perform:
occasions = []
n_events = Zero
def process_event(occasion):
world occasions,n_events,curr_timestamp
i = n_events
whereas i > Zero and occasions[i-1].timestamp > occasion.timestamp:
i = i-1
occasions.insert(i,occasion)
if n_events < params.window:
n_events = n_events+1
else:
occasion = occasions[Zero]
# must delete from occasions checklist now,
# as a result of occasion.course of() may reenter right here
del occasions[Zero]
if occasion.timestamp < curr_timestamp:
sys.stderr.write("Error: OUT OF ORDER occasions detected.n Try rising the scale of the look-ahead window with --window=<n>n")
occasion.course of()
Note that the scale of the queue is configurable, primarily for efficiency and to restrict reminiscence consumption. The perform will report when that queue dimension is inadequate to get rid of out-of-order occasions. It is value contemplating whether or not to think about this case a catastrophic failure and elect to terminate this system.
Implementing a category for every occasion sort led to some consideration for refactoring, such that frequent code may coalesce right into a base class:
class Event (object):
def __init__(self):
self.timestamp = Zero
self.cpu = Zero
self.tid = Zero
self.command = 'unknown'
self.mode = 'unknown'
self.pid = Zero
def course of(self):
world start_timestamp
attempt:
job = duties[self.tid]
if job.pid == 'unknown':
duties[self.tid].pid = self.pid
besides:
job = Task(start_timestamp, self.command, self.mode, self.pid)
duties[self.tid] = job
if self.cpu not in job.cpus:
job.cpus[self.cpu] = CPU()
if job.cpu == 'unknown':
job.cpu = self.cpu
if self.cpu != job.cpu:
job.cpu = self.cpu
job.migrations += 1
return job
Then a category for every occasion sort could be equally constructed:
class Event_sys_enter ( Event ):
def __init__(self, timestamp, cpu, tid, comm, id, pid):
self.timestamp = timestamp
self.cpu = cpu
self.tid = tid
self.command = comm
self.id = id
self.pid = pid
self.mode = 'busy-unknown'
def course of(self):
world start_timestamp, curr_timestamp
curr_timestamp = self.timestamp
if (start_timestamp == Zero):
start_timestamp = curr_timestamp
job = tremendous(Event_sys_enter, self).course of()
if job.mode == 'busy-unknown':
job.mode = 'consumer'
for cpu in job.cpus:
job.cpus[cpu].consumer = job.cpus[cpu].busy_unknown
job.cpus[cpu].busy_unknown = Zero
job.syscall = self.id
if self.id not in job.syscalls:
job.syscalls[self.id] = Call()
job.syscalls[self.id].timestamp = curr_timestamp
job.change_mode(curr_timestamp, 'sys')
Further refactoring is clear above, as nicely, transferring the frequent code that updates related statistics based mostly on a job’s state change and the state change itself right into a change_mode
technique of the Task
class.
Start-of-trace timestamp
As talked about above, for scripts that depend upon elapsed time, there must be a neater solution to get the primary timestamp within the hint apart from forcing each event-handling perform to conditionally save its timestamp because the start-of-trace timestamp.
Awkward invocation
The syntax for invoking a perf
Python script, together with script parameters, is barely awkward:
$ perf script –s ./curt.py -- --window=80
Also, it is awkward that perf
Python scripts are usually not themselves executable.
The curt.py
script was made straight executable and can invoke perf
, which can in flip invoke the script. Implementation is a bit complicated nevertheless it’s simple to make use of:
$ ./curt.py --window=80
This script should detect when it has been straight invoked. The Python surroundings established by perf
is a digital module from which the perf
Python scripts import:
attempt:
from perf_trace_context import *
If this import fails, the script was straight invoked. In this case, the script will exec perf
, specifying itself because the script to run, and passing alongside any command line parameters:
besides:
if len(params.file_or_command) == Zero:
params.file_or_command = [ "perf.data" ]
sys.argv = ['perf', 'script', '-i' ] + params.file_or_command + [ '-s', sys.argv[Zero] ]
sys.argv.append('--')
sys.argv += ['--window', str(params.window)]
if params.debug:
sys.argv.append('--debug')
sys.argv += ['--api', str(params.api)]
if params.debug:
print sys.argv
os.execvp("perf", sys.argv)
sys.exit(1)
In this fashion, the script can’t solely be run straight, it will probably nonetheless be run through the use of the perf script
command.
Simultaneous occasion registration required
An artifact of the way in which perf
allows occasions can result in sudden hint information. For instance, specifying:
$ perf report –a –e raw_syscalls:sys_enter –e raw_syscalls:sys_exit ./command
Will lead to a hint file that begins with the next sequence of occasions for a single job (the perf
command itself):
sys_enter
sys_enter
sys_enter
…
This occurs as a result of perf
will register the sys_enter
occasion for each CPU on the system (due to the -a
argument), then it is going to register the sys_exit
occasion for each CPU. In the latter case, because the sys_enter
occasion has already been enabled for every CPU, that occasion reveals up within the hint; however because the sys_exit
has not been enabled on every CPU till after the decision returns, the sys_exit
name doesn’t present up within the hint. The reverse situation occurs on the finish of the hint file, with a sequence of sys_exit
occasions within the hint as a result of the sys_enter
occasion has already been disabled.
The answer to this situation is to group the occasions, which isn’t nicely documented:
$ perf report –e '' ./command
With this syntax, the sys_enter
and sys_exit
occasions are enabled concurrently.
Awkward recording step
There are loads of totally different occasions required for computation of the complete set of statistics for duties. This results in a really lengthy, sophisticated command for recording:
$ perf report -e '' -a *command --args*
The answer to this situation is to allow the script to carry out the report step itself, by itself invoking perf
. An additional enhancement is to proceed after the recording is full and report the statistics from that recording:
if params.report:
# [ed. Omitting here the list of events for brevity]
eventlist = '' + eventlist + '' # group the occasions
command = ['perf', 'report', '--quiet', '--all-cpus',
'--event', eventlist ] + params.file_or_command
if params.debug:
print command
subprocess.name(command)
The command syntax required to report and report turns into:
$ ./curt.py --record ./command
Process IDs and perf API change
Process IDs are handled a bit cavalierly by perf
scripting. Note nicely above that one of many frequent parameters for the generated event-handling capabilities is known as common_pid
. This isn’t the method ID, however the job ID. In reality, on many present Linux-based distributions, there isn’t a solution to decide a job’s course of ID from inside a perf
Python script. This presents a significant issue for a script that wishes to compute statistics for a course of.
Fortunately, in Linux kernel v4.14, an extra parameter was offered to every of the event-handling capabilities—perf_sample_dict
—a dictionary from which the method ID could possibly be extracted: (perf_sample_dict['sample']['pid']
).
Unfortunately, present Linux distributions might not have that model of the Linux kernel. If the script is written to count on that further parameter, the script will fail and report an error:
SortError: irq__irq_handler_exit_new() takes precisely 11 arguments (10 given)
Ideally, a way to robotically uncover if the extra parameter is handed could be out there to allow a script to simply run with each the outdated and new APIs and to make the most of the brand new API whether it is out there. Unfortunately, such a way isn’t readily obvious.
Since there’s clearly worth in utilizing the brand new API to find out process-wide statistics, curt
gives a command line choice to make use of the brand new API. curt
then takes benefit of Python’s lazy perform binding to regulate, at run-time, which API to make use of:
if params.api == 1:
dummy_dict =
dummy_dict['pattern'] =
dummy_dict['pattern']['pid'] = 'unknown'
raw_syscalls__sys_enter = raw_syscalls__sys_enter_old
[…]
else:
raw_syscalls__sys_enter = raw_syscalls__sys_enter_new
[…]
This requires two capabilities for every occasion:
def raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, perf_sample_dict):
occasion = Event_sys_enter(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, id, perf_sample_dict['pattern']['pid'])
process_event(occasion)
def raw_syscalls__sys_enter_old(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args):
world dummy_dict
raw_syscalls__sys_enter_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, id, args, dummy_dict)
Note that the event-handling perform for the older API will make use of the perform for the newer API, passing a statically outlined dictionary containing simply sufficient information such that accessing it as perf_sample_dict['sample']['pid']
will work (leading to 'unknown'
).
Events reported on different CPUs
Not all occasions that check with a job are reported from a CPU on which the duty is operating. This may lead to an artificially excessive migration rely and different incorrect statistics. For a lot of these occasions (sched_stat
), the occasion CPU is ignored.
Explicit migrations (no sched_migrate occasion)
While there’s conveniently an occasion for when the kernel decides emigrate a job from one CPU to a different, there isn’t a occasion for when the duty requests a migration by itself. These are effected by system calls (sched_setaffinity
), so the sys_exit
occasion handler should evaluate the occasion CPU to the duty’s CPU, and if totally different, presume a migration has occurred. (This is described above, however repeated right here within the “issues” part for completeness.)
Mapping system name IDs to names is architecture-specific
System calls are recognized in occasions solely as distinctive numeric identifiers. These identifiers are usually not readily interpreted by people within the report. These numeric identifiers are usually not readily mapped to their mnemonics as a result of they’re architecture-specific, and new system calls will be added in newer kernels. Fortunately, perf
gives a way to map system name numeric identifiers to system name names. A easy instance follows:
from Util import syscall_name
def raw_syscalls__sys_enter(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, id, args, perf_sample_dict):
print "%s id=%d" % (syscall_name(id), id)
Unfortunately, utilizing syscall_name introduces a dependency on the audit
python bindings. This dependency is being eliminated in upstream variations of perf.
Mapping hypervisor name IDs to names is non-existent
Similar to system calls, hypervisor calls are additionally recognized solely with numeric identifiers. For IBM’s POWER hypervisor, they’re statically outlined. Unfortunately, perf
doesn’t present a way to map hypervisor name identifiers to mnemonics. curt
features a (hardcoded) perform to do exactly that:
hcall_to_name =
'0x4':'H_REMOVE',
'0x8':'H_ENTER',
'0xc':'H_READ',
'0x10':'H_CLEAR_MOD',
[…]
def hcall_name(opcode):
attempt:
return hcall_to_name[hex(opcode)]
besides:
return str(opcode)
Command strings as bytearrays
perf
shops command names and string arguments in Python bytearrays. Unfortunately, printing bytearrays in Python prints each character within the bytearray—even when the string is null-terminated. For instance:
$ perf report –a –e 'sched:sched_switch' sleep three
$ perf script –g Python
generated Python script: perf-script.py
$ perf script -s ./perf-script.py
in trace_begin
sched__sched_switch three 664597.912692243 21223 perf prev_comm=perf^@-terminal-^@, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/three^@^@^@^@^@, next_pid=23, next_prio=Zero
[…]
One answer is to truncate the size of those bytearrays based mostly on null termination, as wanted earlier than printing:
def null(ba):
null = ba.discover('x00')
if null >= Zero:
ba = ba[Zero:null]
return badef sched__sched_switch(event_name, context, common_cpu,
common_secs, common_nsecs, common_pid, common_comm,
common_callchain, prev_comm, prev_pid, prev_prio, prev_state,
next_comm, next_pid, next_prio, perf_sample_dict):print "prev_comm=%s, prev_pid=%d, prev_prio=%d, "
"prev_state=%s, next_comm=%s, next_pid=%d, "
"next_prio=%d" %
(null(prev_comm), prev_pid, prev_prio,
flag_str("sched__sched_switch", "prev_state", prev_state),
null(next_comm), next_pid, next_prio)
Which properly cleans up the output:
sched__sched_switch three 664597.912692243 21223 perf prev_comm=perf, prev_pid=21223, prev_prio=120, prev_state=, next_comm=migration/three, next_pid=23, next_prio=Zero
Dynamic mappings, like IRQ quantity to call
Dissimilar to system calls and hypervisor calls, interrupt numbers (IRQs) are dynamically assigned by the kernel on demand, so there cannot be a static desk mapping an IRQ quantity to a reputation. Fortunately, perf
passes the identify to the occasion’s irq_handler_entry
routine. This permits a script to create a dictionary that maps the IRQ quantity to a reputation:
irq_to_name =
def irq__irq_handler_entry_new(event_name, context, common_cpu, common_secs, common_nsecs, common_pid, common_comm, common_callchain, irq, identify, perf_sample_dict):
irq_to_name[irq] = identify
occasion = Event_irq_handler_entry(nsecs(common_secs,common_nsecs), common_cpu, common_pid, common_comm, irq, identify, getpid(perf_sample_dict))
process_event(occasion)
Somewhat oddly, perf
doesn’t go the identify to the irq_handler_exit
routine. So, it’s attainable that a hint might solely see an irq_handler_exit
for an IRQ and should be capable to tolerate that. Here, as a substitute of mapping the IRQ to a reputation, the IRQ quantity is returned as a string as a substitute:
def irq_name(irq):
if irq in irq_to_name:
return irq_to_name[irq]
return str(irq)
Task Zero
Task Zero reveals up in all places. It’s not an actual job. It’s an alternative to the “idle” state. It’s the duty ID given to the sched_switch
occasion handler when the CPU goes to (or coming from) the “idle” state. It’s usually the duty that’s “interrupted” by interrupts. Tracking the statistics for job Zero as if it have been an actual job wouldn’t make sense. Currently, curt
ignores job Zero. However, this loses some info, like a while spent in interrupt processing. curt
ought to, however presently would not, monitor fascinating (non-idle) time for job Zero.
Spurious sched_migrate_task occasions (similar CPU)
Rarely, a sched_migrate_task
occasion happens during which the supply and goal CPUs are the identical. In different phrases, the duty isn’t migrated. To keep away from artificially inflated migration counts, this case should be explicitly ignored:
class Event_sched_migrate_task (Event):
def course of(self):
[…]
if self.cpu == self.dest_cpu:
return
exec
The semantics of the exec
system name are that the picture of the present course of is changed by a very new course of picture with out altering the method ID. This is awkward for monitoring the statistics of a course of (actually, a job) based mostly on the method (job) ID. The change is critical sufficient that the statistics for every job must be accrued individually, so the present job’s statistics should be closed out and a brand new set of statistics must be initialized. The problem is that each the outdated and new duties have the identical course of (job) ID. curt
addresses this by tagging the duty’s job ID with a numeric suffix:
class Event_sched_process_exec (Event):
def course of(self):
world start_timestamp, curr_timestamp
curr_timestamp = self.timestamp
if (start_timestamp == Zero):
start_timestamp = curr_timestamp
job = tremendous(Event_sched_process_exec, self).course of()
new_task = Task(self.timestamp, self.command, job.mode, self.pid)
new_task.sched_stat = True
new_task.syscall = job.syscall
new_task.syscalls[job.syscall] = Call()
new_task.syscalls[job.syscall].timestamp = self.timestamp
job.change_mode(curr_timestamp, 'exit')
suffix=Zero
whereas True:
old_tid = str(self.tid)+"-"+str(suffix)
if old_tid in duties:
suffix += 1
else:
break
duties[old_tid] = duties[self.tid]
del duties[self.tid]
duties[self.tid] = new_task
This will clearly separate the statistics for the totally different course of photographs. In the instance under, the perf
command (job “9614-0”) exec
‘d exec.sh
(job “9614-1”), which in flip exec
‘d itself (job “9614”):
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[ 9614] execs.sh Four 1.328238 Zero.485604 Zero.000000 Zero.000000 Zero.000000 2.273230 | 44.Four%
[ 9614] execs.sh 7 Zero.000000 Zero.201266 Zero.000000 Zero.000000 Zero.000000 Zero.003466 | 98.three%
[ 9614] execs.sh ALL 1.328238 Zero.686870 Zero.000000 Zero.000000 Zero.000000 2.276696 | 47.Zero% 1-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[9614-0] perf three Zero.000000 Zero.408588 Zero.000000 Zero.000000 Zero.000000 2.298722 | 15.1%
[9614-0] perf Four Zero.059079 Zero.028269 Zero.000000 Zero.000000 Zero.000000 Zero.611355 | 12.5%
[9614-0] perf 5 Zero.000000 Zero.067626 Zero.000000 Zero.000000 Zero.000000 Zero.004702 | 93.5%
[9614-0] perf ALL Zero.059079 Zero.504483 Zero.000000 Zero.000000 Zero.000000 2.914779 | 16.2% 2
-- [ task] command cpu consumer sys irq hv busy idle | util% strikes
[9614-1] execs.sh three 1.207972 Zero.987433 Zero.000000 Zero.000000 Zero.000000 2.435908 | 47.Four%
[9614-1] execs.sh Four Zero.000000 Zero.341152 Zero.000000 Zero.000000 Zero.000000 Zero.004147 | 98.Eight%
[9614-1] execs.sh ALL 1.207972 1.328585 Zero.000000 Zero.000000 Zero.000000 2.440055 | 51.Zero% 1
Distribution help
Surprisingly, there’s presently no help for perf
‘s Python bindings in Ubuntu. Follow the saga for extra element.
Limit on variety of traced occasions
As curt
will get extra subtle, it’s possible that an increasing number of occasions could also be required to be included within the hint file. perf
presently requires one file descriptor per occasion per CPU. This turns into an issue when the utmost variety of open file descriptors isn’t a big a number of of the variety of CPUs on the system. On programs with massive numbers of CPUs, this shortly turns into an issue. For instance, the default most variety of open file descriptors is usually 1,024. An IBM POWER8 system with 4 sockets might have 12 cores per socket and eight threads (CPUs) per core. Such a system has Four * 12 * Eight = 392 CPUs. In that case, perf
may hint solely about two occasions! A workaround is to (considerably) enhance the utmost variety of open file descriptors (ulimit –n
if the system administrator has configured the laborious limits excessive sufficient; or the administrator can set the bounds greater in /and many others/safety/limits.conf
for nofile
).
Summary
I hope this text reveals the facility of perf
—and particularly the utility and adaptability of the Python scripting enabled with perf
—to carry out subtle processing of kernel hint information. Also, it reveals a few of the points and edge instances that may be encountered when the boundaries of such applied sciences are examined.
Please be at liberty to obtain and make use of the curt
device described right here, report issues, counsel enhancements, or contribute code of your individual on the curt
GitHub page.