[Toaster] [PATCH 0/5] Fix task buildstats gathering

Mon Feb 22 02:25:11 PST 2016

On 21/02/2016 12:04, "Richard Purdie" <richard.purdie at linuxfoundation.org>
wrote:

>On Sat, 2016-02-20 at 18:51 +0000, Barros Pena, Belen wrote:
>> So I ran a build following the steps above, and had a look at the
>> data.
>> Information now shows, which is definitely an improvement :)
>> 
>> * Time and Disk I/O show for all executed tasks, which is what's
>> supposed
>> to happen.
>> * CPU usage is missing from some executed tasks. This used to happen
>> before as well, but we never actually worked out why.
>> 
>> Regarding the numbers, I am not sure how useful is the Disk I/O
>> figure
>> expressed in bytes. Should we convert it to something else? And then,
>> our
>> big problem is definitely the CPU usage, which shows pretty crazy
>> numbers.
>> The highest one in my build was for linux-yocto
>> do_compile_kernelmodules
>> at a whopping 2455.15%
>> 
>> So, Richard Purdie pointed out to us that % over 100 are related to
>> task
>> parallelism. And in fact, if I divide the CPU usage value of the
>> compile
>> tasks by the value of PARALLEL_MAKE (36), I do get percentages below
>> 100
>> for all of them. In the example of linux-yocto
>> do_compile_kernelmodules,
>> we get 68.20%
>> 
>> If I divide the CPU usage value of all the install tasks by the value
>> of
>> PARALLEL_MAKEINST (36), the same happens: % below 100.
>> 
>> However, we do get % over 100 for tasks that we have been told have
>> no
>> parallelism at all. I see such values for unpack, patch, configure,
>> package and populate_sysroot tasks.
>
>FWIW, do_package does contain parallelism.
>For unpack/patch/configure/populate_sysroot, there is some parallelism
>too, in that the parent logging runs in parallel with the child
>execution. Where these tasks run quickly, I think this could account
>for the 'parallelism' we see there. Was it only for short running
>tasks?

I am not sure what you'd consider a 'short running' task, so here are a
few examples:

Recipe			task			time (secs)	CPU usage
db			do_unpack		1.29		144.34%
linux-libc-headers 	do_unpack		9.81		134.64%

linux-yocto 		do_patch		16.40		141.12%

rpm-native		do_patch		4.84		117.13%
gcc-cross-i586		do_configure		3.25		125.24%

gcc-cross-initial-i586 	do_configure		3.27		120.50%

glibc-locale		do_populate_sysroot	2.80		192.74%

flex			do_populate_sysroot	1.98		157.74%

>
>> So, the question is, why are those
>> happening? Because for tasks that we know have parallelism we might
>> be able to divide the value by the parallelism set, as I did for
>> compile and install tasks. But for the others, I genuinely have no
>> answer, other that there must be some kind of bug in the data
>> collection.
>
>FWIW I'm very strongly against doing any kind of division by
>PARALLEL_MAKE or similar numbers as it makes the end resulting number
>much less meaningful. The idea behind showing these numbers in the UI
>is to allow people to make decisions based on the numbers. If you do
>that division, I can't think of useful decisions/actions I could then
>make on it.
>
>For example, "2455%" above tells me that we have a parallelism factor
>of about 24 in the kernel build. From that I can conclude that whilst
>the kernel is good, its not making full use of the hardware which would
>have been a factor of 36 (although the system was likely also busy
>doing other things unless the task was run in isolation).]

And what if we look at the other side of the data? What does a value of
7.32% mean, like the one we get for netbase do_compile?

>
>The only proposal I have is to simply display these as parallelism
>factors rather than percentage usages.

So, in the example of linux-yocto do_compile_kernelmodules, would we show
'24'? And for netbase do_compile? '0.07'?

Would it be just easier to show resource usage times, since they are
somehow a standard measure that users might be more likely to recognise?
So we could show either:

* Child rusage ru_utime + Child rusage ru_stime
* or we split CPU usage into 2 columns, one for ru_utime and the other for
ru_stime

> I don't think the data collected
>is wrong, it does require a certain about of thinking to interpret it
>though.
>We're pulling this data direct from the kernel so its unlikely
>we can get any "better" (easier to interpret?) data either.
>
>Cheers,
>
>Richard
>
>