dcsimg

strace: The Friend You Never Knew You Had

While strace is often used for troubleshooting and debugging, you can also use strace to get started on examining the I/O pattern of your serial codes.

One of the handiest tools used by admins to track down problems is strace, a debugging tool that traces system calls and shows where programs encounter problems. While strace is often used for troubleshooting and debugging, you can also use strace to get started on examining the I/O pattern of your serial codes.

With High Performance Computing (HPC) applications, it’s always better to have more information about the behavior of the code than too little. Knowing the behavior of your code means knowing how it scales, if it’s memory bandwidth intensive, if it’s floating point intensive, how integer intensive is it, how much message passing is done, what kind of messages are passed, what size messages are passed, and on and on. With this information you can then start to configure your cluster to match your code.

Configuring or designing a cluster to match your code is a very different approach than in the past. With previous high performance systems, you had to modify your code to match the machine. This meant that every time a new series came out, you had to modify and tune your code, which is a long and laborious process. But now, clusters give you so many options that you can tune the hardware and the software to match your code. However, to do this, you need to know your code (s) very well.

One aspect of knowing your code that many people either miss, skip, or forget, is the I/O pattern. Many times the reason they don’t know is that they are not quite sure how to determine the I/O pattern and develop I/O requirements from them. In some cases, they don’t even know how different I/O rates affect the run time of their code. In this and the next several columns, I hope to give you a basic tool and some tips to examine your I/O patterns and develop basic metrics from them.

I’m sure there are some people who already know how to use strace to examine the I/O of a code, but for those who don’t know how, let’s walk through a simple example based on a simple serial code.

Simple Serial Example

The code below is a simple C program that just creates an array of floats and writes it to a file. Sure it’s a really simple code, but I want to show you how to use strace to look at the I/O pattern of the code.

#include 
#define BUFSIZE 100000
int main(int argc, int *argv[] ){
   int i;
   float buf[BUFSIZE];
   char filename[128];
   FILE *myfile;
   for (i=0; i < BUFSIZE; i++)
      buf[i] = 2.5 + BUFSIZE * i;
   sprintf(filename, "testfile");
   myfile = fopen(filename, "w");
   fwrite(buf, sizeof(float), BUFSIZE, myfile);
   fclose(myfile);
}

I compiled the code using gcc on my home box which has a simple software RAID-1 device with a couple of 40GB Seagate drives and uses Ext3. We will use strace to run the resulting binary. In general, strace is used to capture the function calls from the binary. This includes calls to I/O functions such as open, llseek, write, read, and close. The command line I will use is,

strace -tt -o ./serial.strace ./serial

Notice that I named the binary serial.

I used two options with strace. The first option, -tt gives microsecond timings for all function calls. The second option, -o serial.strace, sends the strace output to a file called serial.strace. The strace output is too long for this column, but I will show you a few lines from the output and how you can derive some metrics from it.

Analyzing the strace Output

The I/O pattern for the code is really simple. It just opens a file, writes the floating data to the file and then closes the data. It's pretty simple, but in this case, there are actually two writes. The first one writes about 397KB of data and the second writes only about 2.7KB. So we have a reasonably large write followed by a very small write. More complicated codes will have different I/O patterns (more on that in future columns). Let's take a closer look at the I/O function calls.

We can ignore most of the stuff in the output since we're really interested in the I/O part of it. The relevant I/O portion (s) of the output are below:

14:47:03.559980 open("testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
14:47:03.560250 write(3, "\0\0 @@Q\303G\240PCHP|\222HPP\303HP$\364H(|\22I(\346*I"..., 397312) = 397312
14:47:03.562485 write(3, "\0\25\240N\r\30\240N\33\33\240N(\36\240N5!\240NB$\240N"..., 2688) = 2688
14:47:03.562613 close(3)

The first line that I've listed opens the file. The next two lines are important because they actually write out the data. You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages.

The function write writes data to a file descriptor (basically a file). You can see the first few bits of data that are being written inside the quotations in the strace output. For this example, the data is binary so the data displayed here doesn't make much sense. But if the data is text or non-binary data, then you might be able to read the first part of the data. strace does not list all of the data being written to the file. The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two writes. The first one writes 397,312 bytes and the second writes 2,688 bytes. The total of the fwrite functions is 400,000 bytes.

The fwrite function calls in the strace output file also contain more useful information. The time that the particular function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function.

In this example, it's the amount of time it took to perform the fwrite function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the first fwrite function is 2,235 microseconds. We get this by comparing the timestamp of the start of the first write (14:47:03.560250) and the timestamp (14:47:03.562485) on the start of the second write.

We also know that the amount of data is 397,312 bytes. So by doing the math (397,312 bytes+ 0.002235 seconds) we find that the resulting I/O rate is 177,768,233 bytes/sec. This is the same as 178 MB/s, assuming that a MB is 1,000,000 bytes.

The amount of time the second fwrite function took is 128 microseconds. The amount of data written is 2,688 bytes so the I/O rate is 2.1x10^7 bytes/sec. This is the same as 21 MBs, a bit slower than the previous write.

Now you can go through and compute the I/O rate for all read and write functions in your strace output! (Doesn't this cry out for automation?).

Observations

Let's compare the I/O rates for the two fwrite functions. The first fwrite writes out a reasonable amount of data- 397KB. But the second fwrite only writes about 2.7KB, a much smaller amount of data. Now compare the two I/O rates. The first I/O rate is about 178 MB/s and the second I/O rate is only about 21 MB/s. Why is there such a big difference?

A number of factors influence the I/O rate. The latency of the drive where it seeks to a certain location on the disk, the actual speed that the drive can write the data on the disk, how fast the file system can translate the data into blocks for the drive controller, any llseek function calls that move the data location to a different point in the file. And believe it not, the amount of data written has an impact on the I/O rate.

For very small amounts of data the amount of time to write the data is almost the same as the seek time or the latency of the drive. This is similar to sending messages across a network. That is, for very small packets, the time to send the data is about the same as the latency of the network.

Examining the time for the second fwrite, we can see that the time is very short and the amount of data is much much smaller than the first fwrite. We can observe that writing small amounts of data results in a very low I/O rate. Consequently, we can conclude that it is good to avoid small writes if possible. This is because small amounts of data have low I/O rates, which can slow down our code (s).

What Have We Learned?

I hope this column has introduced you to using strace to examine the I/O pattern of your code. For this simple example we found out that the resulting binary does a reasonably large write (397KB) followed by a very small write (2.7KB). The strace output can also be used to examine the I/O rates of your code. For this simple serial code we found that the first fwrite function writes data at about 178 MB/s (caching effects included) and the second, much smaller fwrite, has an I/O rate of 21 MB/s.

From these numbers it can be observed that it's a good idea to avoid small writes because the I/O rate is so low. So if your code does a great deal of small writes, and there are many applications that do this, your overall I/O throughput will be very low.

In the next column I will show you how to use strace with an MPI code. We'll also continue to look at more complicated I/O patterns. Until then take a few moments and use strace to look at the I/O of you favorite applications. Just be careful because strace can produce a huge amount of data.

Comments on "strace: The Friend You Never Knew You Had"

gserban

Cool. I actually had no idea this command existed.

parmando

Muy bueno, me di cuenta de varias cosas que pasaba por alto al ver la salida de strace!!

porridge

There is ‘ltrace’ as well. More useful sometimes.

ohmay

And for such “small” writes, how interacts the buffer-cache? In the article it is said that the difference between write rates are because the latency of the drive, the speed of the drive and this kind of things.

Buf for small writes the data writed shouldn’t be writed on buffer-cache?, and because of that at memory speed? Perhaps I’m wrong, but I think that the difference between write rates might be related to the initialization of I/O functions.

In any case, great article.

tombauer

I think an interesting question would be why does the compiler create code that does the write with two write system calls. Why not one? And what is special about 388K? Hmmm.

tothzp

1. A pretty big amount of data can be written to the buffer cache first. So when write() returns you don’t know if the data has been physically written to the disk or not. (Except if you flush your file to disk, e.g. with the fsync() or fdatasync() function.)

2. strace can be also used to collect statistics about the time spent in different system calls. Use the -c option for that:


# strace -p 16640 -c

Process 16640 attached - interrupt to quit
Process 16640 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
36.72 0.649557 36 17944 epoll_wait
33.40 0.590959 7 85036 gettimeofday
21.75 0.384777 16 23568 writev
6.65 0.117595 6 19767 recv
1.48 0.026198 8 3354 time
------ ----------- ----------- --------- --------- ----------------
100.0069 total

I let ppeloe tell me what to write about sometimes. You don’t have to write about everything, just find your own thing to be comfortable with and the blog will nearly write itself.

F*ckin’ amazing things here. I am very glad to see your post. Thanks a lot and i’m looking forward to contact you. Will you please drop me a e-mail?

What i do not understood is in fact how you are not really much more neatly-liked than you may be right now. You are very intelligent. You already know thus significantly when it comes to this subject, made me personally imagine it from so many various angles. Its like women and men are not involved unless it is something to accomplish with Lady gaga! Your own stuffs nice. At all times deal with it up!

Hello are using WordPress for your blog platform? I’m new to the blog world but I’m trying to get started and set up my own. Do you require any coding expertise to make your own blog? Any help would be really appreciated!

Hello there, I found your site via Google whilst looking for a similar matter, your site came up, it appears to be like great. I have bookmarked it in my google bookmarks.

You are my aspiration, I have few blogs and sometimes run out from brand :). “No opera plot can be sensible, for people do not sing when they are feeling sensible.” by W. H. Auden.

Every after in a even though we decide on blogs that we study. Listed beneath are the most up-to-date sites that we pick out.

“I think this is a real great article.Really thank you! Cool.”

ovHUO1 nstzbqtvsiqm, [url=http://okaxwxbofols.com/]okaxwxbofols[/url], [link=http://onnunjtenwrd.com/]onnunjtenwrd[/link], http://vizuaerhrevn.com/

You’ve made some decent points there. I checked on the internet
to learn more in regards to the issue and found a lot of people should go as well as your opinion of
this web site.

My homepage – CarlGCarlino

I just want to mention I am just newbie to blogging and site-building and absolutely savored you’re web blog. Almost certainly I’m planning to bookmark your blog post . You absolutely come with excellent writings. Cheers for sharing your web site.

Appreciate this post. Permit me to give it
a try.

my blog post MerriNGascot

I relish, lead to I discovered exactly what I was taking a look for.
You’ve ended my 4 day lengthy hunt! God Bless you man. Have a nice day.
Bye

my webpage; HongXHolsman

Leave a Reply