Hacking Ubuntu Touch, Part 7: System and process monitoring tools (part 1)
NOTE: This is a continuation of the series and relies on having Developer mode enabled. Thanks to Oliver Grawert, Stéphane Graber, Ricardo Salveti and Selene Scriven from Canonical for their input.
In the last article I introduced you, the reader, to the different logging facilities present on current Ubuntu Touch devices and how you can query the stored messages. Well, log files are an easy way to debug easy problems, but their usefullness can be quite limited:
-
The process you are looking for has to actually use a logging facility. Not every single one does, you might have noted all the missing fields in the table at the end of my previous article.
-
The process has to actually log the one thing you’re interested in.
-
You get to have an idea about which component is at fault, or at least what the logged message could look like. Even if you’re run
grep
over all logfiles, you have to know what you’re looking for.
In many cases you have a clear idea about what’s going wrong and there are enough log messages, e.g. let’s say your phone refuses to connect to the mobile network despite the baseband being powered on and a SIM card being present. There aren’t many components which are involved in mobile communications, at the end it quickly narrows down to the likes of ofono
and rild
, and usually the logging facilities and ofono debugging scripts give you enough information to quickly debug the problem.
But there are a lot of other cases when you don’t immediately know who is at fault, e.g. when the UI locks up or starts lagging or the battery runs out in a very short amount of time. Or you may not even know if something is happening at all, like when the phone “seems” to consume more power than usual. You might be able to find out which component is involved, but if it doesn’t generate any suspicious log messages by itself, what do you do?
This article might contain some answers to these questions. I artificially limited myself to things that can be done on an unmodified phone, because you don’t want/can’t manipulate your phone image every time you are just looking for something, you might not have a network connection at hand to download additional software, or maybe you are assisting a friend over the phone and talking him into enabling Developer mode was already hard enough.
ps
ps
prints a list of all running processes, optionally augmented with information from in-kernel process descriptors and various kernel sub-systems, filtered through a list of configurable filters and modified by output options.
My standard set of options for a quick overlook is -yfMeHl
. It selects all processes on the whole system, prints them in a “process tree”, switches the output mode to “long” format, shows full field information and adds fields about memory consumption and security.
Let’s look at an artificial (!) and manually filtered (!) example on my bq Aquaris E4.5 Ubuntu Edition:
Because of the “process tree” format you can immediately see that there are several hierarchies: the in-kernel threads are started by [kthreadd]
, the Android container runs below lxc-start -n android -- /init
, all user-space system processes are started by /sbin/init
and all user-space user processes by an user init session started as init --user
. You can see that apps are not started by unity8-dash
, but unity8-dash
tells init --user
to start them, so there’s no process hierarchy under unity8-dash
. Also we now know that apps are normal processes and not something special, and we can see that scopes do not seem to be processes.
Now for the important columns in the output.
The first column shows the security label (LABEL
) associated with the process. There are two possibilities: either the special value unconfined
, which means that there are no restrictions imposed on the process, or an actual string. If there is a string set, you can find a directory of the same name in /sys/kernel/security/apparmor/policy/profiles/
and the system has attached restrictions to this process. This is what Ubuntu Touch developers refer to as running something (like every app) in confined
mode. We will have a closer look at confinement in a future article.
The second column shows the process state (S
). There are seven possible values, of which only five can be actually seen:
-
D
for uninterruptible sleep. This is usually seen when the process is waiting for an I/O transaction. -
R
for running/runnable. The process is either currently being executed on one of the CPUs, or it is not blocked by anything and is ready to be executed. -
S
for interruptible sleep. The process is waiting for some event, e.g. a timer going off. -
T
for stopped. There are two reasons for this: someone, e.g. the user or a job control system, stopped the process, or it is currently being traced (e.g. a debugger). -
W
for paging. This should never be seen on a phone because this state was deprecated in the kernel 2.6 series. -
X
for dead. This should also never be seen. -
Z
for zombie. The process terminated, but its parent didn’t go and reap it, so it still lives on.
Now for the above example: Looks pretty normal at first sight, most processes are sleeping and ps -yfMeHl
is running (because I ran it in phablet-shell
to get this output). But wait, what’s up with all the qmlscene
processes at the bottom? Why have they been stopped? If you run ps -yfMeHl
on your desktop, you’ll notice that state T
hardly ever shows up. On the phone this state can be seen all the time because of the App Lifecycle, Unity8 just freezes every App which is not actively running in the foreground at the moment.
The sixth column is the CPU utilization over the lifetime of the process (C
). This will hardly ever be anything else than “0” on the phone, because the phone sleeps all the time and no process manages to accumulate a significant amount of CPU time.
The seventh and eighth column are priority (PRI
) and niceness (NI
), respectively. These are used by the kernel scheduler to decide which processes run in which order and which process takes precedence if two or more “compete” for the same place. The actual meaning of both values is a kernel implementation detail and has changed a couple of times over the years, but a high priority value still means a lower priority, and the lower the niceness, the more favorable for the process. But why are there two values for the same thing anyways? The answer is that the current priority is calculated internally by the kernel based on different things. The niceness value on the other hand is a constant that can be defined by userspace and modifies the vallue calculated by the kernel. Without the niceness value, there wouldn’t be any possibilities for the user to tell the system that he knows better or thinks different.
The ninth (RSS
) and tenth (SZ
) column show “Resident Set Size” and “Size”, respectively. RSS
is displayed in kilobytes and refers to the amount of actual, physical memory that’s currently used by the process, so it doesn’t include any pages currently swapped out. This is a major “problem” on the phone. The Aquaris E4.5 has 1 GB of RAM, but 512 MB are set up as compressed swap:
So it is actually normal that huge portions of many processes, especially apps that haven’t been used for some time, are swapped out! That’s why SZ
is more interesting, because it displays the total virtual size (all of code, data, and stack) of the process - if there wasn’t an additional catch. SZ
is in system pages, not in kilobytes. Argh. Now what’s the page size on ARM?
So we have to multiply SZ
by 4096 on this device, and we also have to keep in mind that SZ
accounts every shared library at its full size. The sum of all virtual process sizes on my device seems to be about 10 gigabytes, so this is also less helpful. Well, nobody said memory management is easy…
Column twelve displays the process start timestamp.
Column thirteen shows the TTY associated with the process. Only interactive processes have one.
Column fourteen shows the accumulated CPU time in hours, minutes and seconds.
The last column displays the full command, including parameters. If the name is shown between brackets, it is a kernel thread.
You can sort the list by using the --sort
parameter, like --sort=%cpu
or --sort=%mem
, but keep in mind that e.g. CPU utilisation is cumulative over the lifetime of the process, so you can’t find out which process is the biggest CPU hog at the exact moment.
It is possible to switch from process to thread mode by adding the -L
parameter. The output is is mostly the same, but you might get additional columns like LWP
(thread ID within a process) and NLWP
(number of threads in the process).
top
top
is the go-to tool if you want a quick, interactive overview of which process(es) and/or thread(s) on your system uses the most of a given resource, usually CPU or memory.
Let’s look at a typical output:
By default the screen is sorted by CPU utilization and updated every three seconds (this is changeable with the -d
parameter). It will run until stopped by a keypress on q
or Strg+c
. top
is interactive, so you can switch between multiple pages.
The main page header shows general system statistics in the following order:
First row: current system time, uptime in days, minutes and seconds, number of active users, load average over the last one, five and fifteen minutes.
Second row: total number of processes, number of running, sleeping, stopped and zombie processes.
Third row: percentage of CPU resources spent on user processes (us
), kernel processes (sy
), niced processes (ni
), being idle (id
), waiting for I/O completion (wa
), handling hardware (hi
) of software (si
) interrupts, stolen (st
) by the hypervisor. This is cumulative over all CPUs by default.
Fourth and fifth row: total available, used and free memory/swap, memory in buffers or caches.
This already gives us a good indication of what’s going on: the load average is negligible, so the system isn’t overloaded with anything. The vast majority of processes is sleeping or stopped, actually only top
is running and only two processes are consuming a bit of CPU: media-hub-server
and pulseaudio
, because I was listening to some music.
The memory usage looks a bit odd though: the device has 1 GB of physical memory, but 512 MB are set up as compressed swap (as mentioned before), and it looks like most of the memory is full. This is “normal” on the phone: The App Lifecycle will always keep as many apps in memory as possible. If the kernel runs out of memory, it will select the best “victim” (e.g. the app that hasn’t been used for the longest time) and kills it. The system still shows the killed app in the app switcher and restarts it if selected.
The process list on the main page is sorted by top CPU usage and shows the following columns:
PID
, USER
, PRI
and NI
are identical to the columns of the same name shown by ps
.
VIRT
, RES
and SHR
display the virtual, resident and shared memory size of the process in kilobytes. VIRT
is the total size of all sections, regardless if in memory or swapped out. RES
is identical to the RSS
column displayed by ps
. SHR
is the amount of shared memory available to the process, this is a “potential” value, so it doesn’t say that the memory is actually shared.
S
is identical to the S
column shown by ps
.
%CPU
and %MEM
display the percentage of CPU and physical RAM used by a process, but in contrast to ps
the values are updated between iterations.
There are several possibilities to change the content of the process table:
-
You can change which fields are displayed in which order by pressing the
f
key. It leads to an interactive screen. -
The
c
key extends theCOMMAND
column to full length. This is usually necessary because the field is cut off. -
The
o
key prompts for a filter criteria. See the man page for an introduction on how to write a filter rule. -
The
u
key prompts for a username or user ID, if entered only processes owned by this user will be shown. -
The
V
key activates a “process tree” display like shown with theps
command before. Obviously you will need a wide terminal for this. -
By default the sort field is
%CPU
. You can switch between the currently displayed columns with the<
(left) and>
(right) keys.
The man page contains many more key bindings and options.
vmstat
vmstat
initially only displayed information about the kernel virtual memory subsystem and was later extended to display additional system information. It can be run in two modes: when you just call vmstat
without an interval, it ouputs values since the last reboot. When you specific an interval, like vmstat 1
, it first outputs the values since the last reboot, but afterwards continues to print a delta every n seconds.
The default output is optimized for terminals 80 columns wide and shows 17 different values:
r
and b
is the number of processes in the “running” and “uninterruptible sleep” states.
swpd
is the amount of used swap, free
the amount of free memory, buff
the amount of memory used for I/O buffers and cache
the amount of memory used as read caches.
si
and so
is the amount of memory swapped in from/out to disk, per second.
bi
and bo
is the amount of blocks read/written to block devices, per second.
in
and cs
is the number of interrupts and context switches, per second. The interrupt counter includes the clock and peripherials, so this number wont’t go down even if the phone goes to sleep.
us
, sy
, id
, wa
and st
are identical to the colums of the same name displayed by top
.
If you know better and/or something has changed, please do find me on the Freenode IRC or on Launchpad.net and get in contact!