In the last couple of articles we have talked about using strace to help examine the IO profile of applications (including MPI applications; think HPC). But strace output can contain hundreds of thousands of lines. In this article we talk about the using a tool called strace_analyzer to help sift through the strace output.
Recall that I used the “-short” flag. If you don’t use this flag you will get considerably more output in this section. Using the “-short” flag is a good place to start and perhaps you can try leaving out that flag at some point to get more information.
The first section of this output group presents the distribution of the size of the write() syscalls. For example, the first line lists the number of write() syscalls that have more than 0KB and less than 1KB of data. In this case it’s 12,999 system calls.
The second section of the output group contains a statistical summary of the write system calls. It lists a great deal of information:
- Total amount of data written in bytes (sum of all write() system calls)
- Total number of write() syscalls
- Average (mean) of the number of bytes per write() syscall (in this case it means the Arithmetic Mean
- Standard Deviation of the number of bytes per write() syscall using the mean (in this case it’s the Sample Standard Deviation
- Mean Absolute Deviation of the number of bytes per write() syscall (this is the Absolute Deviation using the mean)
- The median number of bytes in a write() syscall (Definition of median)
- The Median absolute deviation of the number of bytes in a write() syscall (this is the Absolute Deviation using the median)
- The time for the slowest write() syscall in seconds
- The location in the strace file of the slowest write() syscall (line number in the file)
- The smallest number of bytes in a write() syscall
- The largest number of bytes in a write() syscall
Notice that I’m using the decimal definition for
Megabyte rather than the binary definition which would be listed as MiB. An upcoming option in strace_analyzer will allow you to flip between the two with decimal being the default.
The write output group is created for every process for Abaqus/Explicit (16 processes). Table 3 below contains the write() syscall “size” distribution for all 16 processes. The table is truncated for 128KB to 256KB because none of the processes had a write() syscall greater than 256KB.
Table 3 – Write syscall Interval Statistics
Process |
0 < < 1KB |
1KB < < 8KB |
8KB < < 32KB |
32KB < < 128KB |
128KB < < 256KB |
12419 |
12,999 |
61,049 |
368 |
0 |
0 |
12420 |
72 |
1,796 |
0 |
0 |
0 |
12421 |
72 |
1,808 |
0 |
0 |
0 |
12422 |
72 |
1,806 |
0 |
0 |
0 |
12423 |
72 |
1,808 |
0 |
0 |
0 |
12424 |
72 |
1,802 |
0 |
0 |
0 |
12425 |
72 |
1,796 |
0 |
0 |
0 |
12427 |
72 |
1,796 |
0 |
0 |
0 |
14297 |
72 |
1,796 |
0 |
0 |
0 |
14298 |
72 |
1,796 |
0 |
0 |
0 |
14299 |
72 |
1,806 |
0 |
0 |
0 |
14300 |
72 |
1,808 |
0 |
0 |
0 |
14301 |
72 |
1,806 |
0 |
0 |
0 |
14302 |
72 |
1,810 |
0 |
0 |
0 |
14303 |
72 |
1,796 |
0 |
0 |
0 |
14304 |
72 |
1,796 |
0 |
0 |
0 |
The idea that process 12419 is the rank-0 process doing all the IO for the application is further supported by this table. Process 12419 does an enormous number of write() syscalls less than 1KB relative to the other processes. The same is true for the other write syscall “size intervals”.
The second section of the write output group is summarized in Table 4 below.
Table 4 – Write Statistical Summary
Process |
Total Bytes written |
Mean Bytes per syscall |
Standard Deviation (bytes) |
Mean Absolute Deviation (bytes) |
Median Bytes per syscall |
Median Absolute Deviation (bytes) |
Slowest Write time (secs) |
12419 |
134,122,801 |
1,804.76 |
1,137.07 |
1,016.65 |
2,048 |
473.42 |
0.0502 |
12420 |
7,362,795 |
3,941.54 |
771.62 |
3,222.57 |
4,096 |
154.46 |
0.0012 |
12421 |
7,411,947 |
3,942.53 |
769.25 |
3,225.41 |
4,096 |
153.48 |
0.0024 |
12422 |
7,403,755 |
3,942.36 |
769.64 |
3,224.94 |
4,096 |
153.54 |
0.0020 |
12423 |
7,411,947 |
3,942.53 |
769.25 |
3,225.41 |
4,096 |
153.48 |
0.0013 |
12424 |
7,387,371 |
3,942.03 |
770.43 |
3,223.99 |
4,096 |
153.97 |
0.0018 |
12425 |
7,362,795 |
3,941.54 |
771.62 |
3,222.57 |
4,096 |
154.46 |
0.0012 |
12427 |
7,362,795 |
3,941.54 |
771.62 |
3,222.57 |
4,096 |
154.46 |
0.0020 |
14297 |
7,362,790 |
3,941.54 |
771.63 |
3,222.56 |
4,096 |
154.46 |
0.0016 |
14298 |
7,362,789 |
3,941.54 |
771.64 |
3,222.46 |
4,096 |
154.46 |
0.0503 |
14299 |
7,403,757 |
3,942.36 |
769.64 |
3,224.94 |
4,096 |
153.64 |
0.0014 |
14300 |
7,411,947 |
3,942.36 |
769.54 |
3,225.41 |
4,096 |
153.48 |
0.0017 |
14301 |
7,403,748 |
3,942.36 |
769.66 |
3,224.92 |
4,096 |
153.64 |
0.0017 |
14302 |
7,420,139 |
3,942.69 |
768.86 |
3,225.99 |
4,096 |
153.31 |
0.0012 |
14303 |
7,362,792 |
3,941.54 |
771.64 |
3,222.56 |
4,096 |
154.46 |
0.0016 |
14304 |
7,362,792 |
3,941.54 |
771.64 |
3,222.56 |
4,096 |
154.46 |
0.0016 |
Again we can see that process 12419 is likely to be the rank-0 process since it writes 18 times more data than any other process. What is also interesting is that the average (mean) number of bytes per write() syscall is approximately 1.8KB. Many people expect HPC applications to do large writes but this particular application does a great deal of very small writes with average being just 1.8KB. Moreover the mean number of bytes per write() syscall is only 2,048 bytes (2 KiB).
The next output group is very similar to the write output group but covers read() syscalls. A sample of the output from strace_analyzer for this output group for process 12419 is below.
---------------------
-- Read Statistics --
---------------------
-- File sizes for read syscalls() --
IO Size Range Number of syscalls
=======================================================
( 1) 0KB < < 1KB 3819
( 2) 1KB < < 8KB 2523
( 3) 8KB < < 32KB 7
( 4) 32KB < < 128KB 0
( 5) 128KB < < 256KB 0
( 6) 256KB < < 512KB 0
( 7) 512KB < < 1000KB 0
( 8) 1000KB < < 10MB 0
( 9) 10MB < < 100MB 0
( 10) 100MB < < 1GB 0
( 11) 1GB < < 10GB 0
( 12) 10GB < < 100GB 0
( 13) 100GB < < 1TB 0
( 14) 1TB < < 10TB 0
-- READ SUMMARY --
Total number of Bytes read = 10,240,234 (10.240234 MB)
Number of Read syscalls = 6,349
Average (mean) Bytes per syscall = 1,(bytes) (739014 MB)
Standard Deviation: 1,7 (bytes) (9620697 MB)
Mean Absolute Deviation: 1,(bytes) (699287 MB)
Median Bytes per syscall = 4,096 (bytes) (0.004096 MB)
Median Absolute Deviation: 2,4 (bytes) (0746574 MB)
Time for slowest read syscall (secs) = 0.000226
Line location in file: 2382
Smallest read syscall size: 0 (Bytes)
Largest read syscall size: 14336 (Bytes)
As you can see this group looks very similar to the write() output group.
Table 5 below presents the read syscall interval "size" summary which lists the number of read() syscalls that read data within the specified interval. As with the write syscall summary, no data is listed in the table for greater than a 256KB read() syscall because the count is zero.
Table 5 - Read syscall Interval Statistics
Process |
0 < < 1KB |
1KB < < 8KB |
8KB < < 32KB |
32KB < < 128KB |
128KB < < 256KB |
12419 |
3,819 |
2,523 |
7 |
0 |
0 |
12420 |
227 |
1,357 |
0 |
0 |
0 |
12421 |
227 |
1,354 |
0 |
0 |
0 |
12422 |
227 |
1,352 |
0 |
0 |
0 |
12423 |
227 |
1,354 |
0 |
0 |
0 |
12424 |
227 |
1,350 |
0 |
0 |
0 |
12425 |
227 |
1,354 |
0 |
0 |
0 |
12427 |
227 |
1,356 |
0 |
0 |
0 |
14297 |
232 |
1,356 |
0 |
0 |
0 |
14298 |
232 |
1,354 |
0 |
0 |
0 |
14299 |
232 |
1,354 |
0 |
0 |
0 |
14300 |
232 |
1,355 |
0 |
0 |
0 |
14301 |
232 |
1,353 |
0 |
0 |
0 |
14302 |
232 |
1,356 |
0 |
0 |
0 |
14303 |
232 |
1,356 |
0 |
0 |
0 |
14304 |
232 |
1,354 |
0 |
0 |
0 |
The suspected rank-0 process does about 17 times more read() syscalls with data larger than 0KB and smaller than 1KB. What is also interesting is that the other processes do a number of read() syscalls even though it appears they don't read or write much data.
Table 6 below lists the statistical read() syscall summary for all 16 processes. it is very similar to Table 4 but for read() syscalls.
Table 6 - Read Statistics Summary
Process |
Total Bytes |
Mean Bytes per syscall |
Standard Deviation (bytes) |
Mean Absolute Deviation (bytes) |
Median Bytes per syscall |
Median Absolute Deviation (bytes |
Slowest Write time (secs) |
12419 |
10,240,234 |
1,612.89 |
1,842.22 |
1,761.42 |
4,096 |
2,506.547 |
0.0002 |
12420 |
5,611,044 |
3,542.33 |
1,349.55 |
2,500.97 |
4,096 |
553.97 |
0.0002 |
12421 |
5,598,755 |
3,541.27 |
1,350.61 |
2,499.74 |
4,096 |
554.73 |
0.0134
|
12422 |
5,590,563 |
3,540.57 |
1,351.33 |
2,498.94 |
4,096 |
555.43 |
0.0138 |
12423 |
5,598,756 |
3,541.28 |
1,350.61 |
2,499.75 |
4,096 |
554.72 |
0.0001 |
12424 |
5,582,372 |
3,539.87 |
1,352.04 |
2,498.12 |
4,096 |
556.13 |
0.0135 |
12425 |
5,598,756 |
3,541.28 |
1,350.61 |
2,499.75 |
4,096 |
554.72 |
0.0135 |
12427 |
5,606,948 |
3,541.98 |
1,349.90 |
2,500.57 |
4,096 |
554.02 |
0.0135 |
14297 |
5,607,058 |
3,530.89 |
1,362.13 |
2,488.24 |
4,096 |
565.11 |
0.0423 |
14298 |
5,598,865 |
3,530.18 |
1,362.84 |
2,487.43 |
4,096 |
565.82 |
0.1441 |
14299 |
5,598,866 |
3,530.18 |
1,362.84 |
2,487.43 |
4,096 |
565.82 |
0.1441 |
14300 |
5,602,962 |
3,530.54 |
1,362.48 |
2,487.83 |
4,096 |
565.46 |
0.0667 |
14301 |
5,594,769 |
3,529.82 |
1,363.19 |
2,487.02 |
4,096 |
566.18 |
0.0446 |
14302 |
5,607,058 |
3,530.89 |
1,362.13 |
2,488.24 |
4,096 |
565.11 |
0.0554 |
14303 |
5,607,058 |
3,530.89 |
1,362.13 |
2,488.24 |
4,096 |
565.11 |
0.0618 |
14304 |
5,598,866 |
3,530.18 |
1,362.84 |
2.487.43 |
4,096 |
565.82 |
0.0197 |
The suspected rank-0 process, process 12419, does about twice the total amount of reading as do the other processes. What is also interesting is that the average (mean) bytes per read() syscall is 1.6KB while for the other processes its about 3.5KB.
The next three output groups; close statistics, open statistics, and lseek statistics, are usually not very interesting when using the "-short" option. The output from the strace_analyzer for these output groups from process 12419 is presented below
----------------------
-- Close Statistics --
----------------------
-- CLOSE SUMMARY --
Total number of close syscalls = 297
Average time for close syscalls (secs) = 0.000006
Maximum Time for close syscalls (secs) = 0.024903
Line location in file: 166719
---------------------
-- Open Statistics --
---------------------
-- OPEN SUMMARY --
Total number of open syscalls = 287
Average time for open syscalls (secs) = 0.000000
Maximum Time for open syscalls (secs) = 0.000039
Line location in file: 2351
------------------------------------
-- lseek unit activity Statistics --
------------------------------------
unit Number of lseeks
========================
4 1
7 2
8 1
9 464
10 5
11 2706
12 1
14 16506
15 3
For both open() and close() syscalls the summary lists the total number of respective syscalls, the average (median) time for the close() and open() syscalls, the time for the slowest open() or close() syscall and where that syscall is located in the strace file.
The lseek summary is a little different. It lists the number of lseek calls by their file descriptor (fd). In the brief table the file descriptor is listed as a "unit" (guess that shows my Fortran roots).
The next to last output group is for the IOPS Statistics. This is a little more interesting because it examines three IOPS measures: (1) Write IOPS, (2) Read IOPS, and (3) Total IOPS. IOPS stands for the Input/Output Operations per Second - literally how many IO operations per second. The strace_analyzer computes these statistics by counting the number of write() syscalls, read syscalls(), and all IO syscalls (those that it tracks), in a given amount of time. For example, it computes Write IOPS by counting the number of write() syscalls in 1 second intervals during the run of the application. So it counts the number of write() syscalls in the first second of the application, then the second second, then the third second, and so on. The Read IOPS and Total IOPS are computed in a similar fashion.
The IOPS output group lists the maximum Write IOPS, Read IOPS, and Total IOPS, and during what second "time" interval it occurs. It also computes the Average Write IOPS, Average Read IOPS, and Average Total IOPS by counting all of the respective IO syscalls and dividing by the total run time of the application. A sample of the IOPS output group for process 12419 is shown below:
---------------------
-- IOPS Statistics --
---------------------
Maximum Write IOPS = 4339 occurred at 423 seconds
Average Write IOPS = 2752
Maximum Read IOPS = 2652 occurred at 8 seconds
Average Read IOPS = 244
Maximum Total IOPS = 10042 occurred at 9 seconds
Average Total IOPS = 3753
*****************
Final IOPS report
Average Write IOPS = 2752
Average Read IOPS = 244
Average Total IOPS = 3753
*****************
You can see that the strace_analyzer lists the maximum IOPS (Write, Read, Total) as well as when they occurred along with the average IOPS (again for Write, Read, and Total).
A summary of the IOPS for the 16 processes is contained below in Table 7. It lists the maximum IOPS measures and when they occurred during the run as well as the average IOPS measures.
Table 7 - IOPS Statistics
Process |
Average Write IOPS |
Average Read IOPS |
Average Total IOPS |
Max Write IOPS |
Max Write IOPS Time (secs) |
Max Read IOPS |
Max Read IOPS Time (secs) |
Max Total IOPS |
Max Total IOPS Time (secs) |
12419 |
2,752 |
244 |
3,753 |
4,339 |
423 |
2,652 |
8 |
10,042 |
9 |
12420 |
311 |
396 |
1,213 |
898 |
10 |
1,335 |
9 |
2,676 |
9 |
12421 |
313 |
395 |
1,216 |
904 |
10 |
1,332 |
9 |
2,670 |
9 |
12422 |
313 |
394 |
1,215 |
903 |
10 |
1,330 |
9 |
2,666 |
9 |
12423 |
313 |
395 |
1,216 |
904 |
9 |
1,355 |
8 |
2,742 |
8 |
12424 |
312 |
394 |
1,213 |
901 |
10 |
1,328 |
9 |
2,662 |
9 |
12425 |
311 |
395 |
1,212 |
898 |
10 |
1,332 |
9 |
2,670 |
9 |
12427 |
311 |
395 |
1,213 |
898 |
10 |
1,334 |
9 |
2,674 |
9 |
14297 |
373 |
529 |
1,216 |
898 |
417 |
1,334 |
2 |
36,602 |
2 |
14298 |
467 |
528 |
1,458 |
899 |
2 |
1,332 |
2 |
1,458 |
2 |
14299 |
313 |
317 |
914 |
903 |
424 |
1,332 |
9 |
3,972 |
9 |
14300 |
313 |
396 |
1,219 |
904 |
9 |
1,333 |
8 |
2,672 |
8 |
14301 |
469 |
528 |
1,462 |
904 |
2 |
1,331 |
2 |
4,474 |
2 |
14302 |
376 |
397 |
1,220 |
906 |
8 |
1,334 |
8 |
4,484 |
8 |
14303 |
373 |
539 |
1,459 |
898 |
3 |
1,352 |
2 |
2,724 |
2 |
14304 |
467 |
528 |
1,458 |
899 |
2 |
1,332 |
2 |
4,466 |
2 |
The largest values for Write IOPS, Read IOPS, and Total IOPS and their averages all occurred with process 12419, the rank-0 process.
The final output group is a per file statistics section. It lists the file primary and distinctive files opened by the application that are not associated with communication such as sockets, the number of write bytes, the average write throughput, the number of read bytes, and the average read throughput. A simple of this output group is shown below (sorry for the width but strace_analyzer writes out the entire path for the file since that can be important).
Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec
=========================================================================================================================
/application/Simulia/benchmark/e5_dellamd_8core.prt 0 134,350,403.33 3,079,008 384,279,539.18
/dev/infiniband/uverbs0 0 71,436,679.29 5,260 1,304,000.23
/sys/class/infiniband_verbs/uverbs0/abi_version 2 816,798.94 0 0
/proc/cpuinfo 6,248 1,239,084.77 0 0
/etc/protocols 4,096 143,908,378.00 0 0
/application/Simulia/benchmark/e5_dellamd_8core.use.1 0 292,571,428.57 4,060 11,938,059.90
/application/Simulia/benchmark/e5_dellamd_8core.pac.1 1,880,064 478,023,617.34 0 0
/application/Simulia/benchmark/e5_dellamd_8core.sel.1 12,288 379,259,259.26 8,192 463,238,095.24
/application/Simulia/benchmark/e5_dellamd_8core.abq.1 3,653,632 763,882,173.43 7,430,144 398,649,798.81
/application/Simulia/benchmark/e5_dellamd_8core.sta 0 0 4,440 11,999,119.00
/application/Simulia/benchmark/e5_dellamd_8core.msg.1 0 0 177 6,321,428.57
/application/Simulia/benchmark/e5_dellamd_8core.odb 1,473,576 82,406,285.85 123,591,248 253,690,560.35
/proc/12441/status 765 725,553,148.58 0 0
From this listing we can tell that most of the reading takes place in three files:
- /application/Simulia/benchmark/e5_dellamd_8core.pac.1 (1.88 MB)
- /application/Simulia/benchmark/e5_dellamd_8core.abq.1 (3.65 MB)
- /application/Simulia/benchmark/e5_dellamd_8core.odb (1.47 MB)
But most of the writing takes place in the files,
- /application/Simulia/benchmark/e5_dellamd_8core.odb (123.59 MB)
- /application/Simulia/benchmark/e5_dellamd_8core.abq.1 (7.43 MB)
- /application/Simulia/benchmark/e5_dellamd_8core.prt (3.08 MB)
The /application directory is NFS mounted on all of the nodes. Hence the vast majority of the IO takes place over NFS.
The strace_analyzer also produces a large number of files for plotting if you use the "-dat" option. This information can be very useful because it can give you an idea of what the application is doing as a function of time. Since it's fairly obvious that process 12419 is the rank-0 process and does virtually all of the IO for the application and we want to keep this article smaller than a novel, I will plot only the results for process 12419.
The first plot I like to examine is the distribution of write() syscalls. Figure 1 below plots the size of the write() syscalls on the x-axis and the number of write() syscalls for that size on the y-axis.

Figure 1: Write syscall size distribution
This plots supplements the table in the write output group of the output. You can see that the vast majority of the write() syscalls are all below 4,000 bytes. But the peak occurs at around 2,000 bytes (actually it's 2,048 bytes).
People usually think HPC applications do large streaming writes but this plot shows that this HPC application does all of it's writes() in very small increments (2,048 bytes). If you couple this with the knowledge that there are a large number of lseek syscalls, it's fairly obvious that this application does not do large streaming writes.
We can create the same plot for read syscalls as in Figure 2 below.

Figure 2: Read syscall size distribution
As with the write syscalls, the read syscalls are all very small - smaller than about 4,000 bytes. The vast majority of them are all smaller than 2,000 but the large number of read system calls are right around 4,000 bytes (actually it's 4,096 bytes).
After examining the distribution of write and read syscall sizes, I tend to plot the time history of the application write() and read() syscalls. Actually I like to plot the write syscall size vs. time above a plot of the Write IOPS vs. time. This plot allows me to look for increased IOPS activity relative to the size of the data in the write syscall. Figure 3 below is this plot for process 12419.

Figure 3: Write syscall size time history (top) and Write IOPS time history (bottom)
We can see that at regular intervals the application wrote small amounts of data (less than 2,000 bytes per syscall) but it also wrote larger amounts of data in syscalls that are just a bit larger than 14,000 bytes. These intervals are correlated with increased Write IOPS. The peak Write IOPS occurs near the end of the run but there is a secondary peak near the beginning (could be writing the initial data to the output file).
After examining the write syscall sizes, I also like to examine the write throughput vs. time relative to the Write IOPS. This type of plot can help explain the IO behavior a bit more. For example, if there is a great deal of throughput and little IOPS then the application may be doing large streaming writes (checking the previous plot will also help determine that). Or the plot can help explain why you are getting a large Write IOPS value without much throughput where perhaps there are small writes with a large number of lseeks in between. Figure 4 plots the write throughput time history in MB/s on top and the Write IOPS time history on the bottom.

Figure 4: Write throughput time history in MB/s (top) and Write IOPS time history (bottom).
The write throughput is correlated to the Write IOPS and the peaks occur at regular intervals. The peaks never exceed 1.8GB/s which seems very high for a single disk using NFS over GigE but you have to remember that there are data caches at multiple levels in the system. The IO scheduler has a data cache, the file system has a cache, and the drives have a cache. So the 1.8GB/s may not be the actual performance to the drive. However, I am interested in the performance the application is experiencing. It is very important to note that this plot will be dependent upon the storage system, OS, distro, etc., that you use for running the actual application. This means that this plot will change if you run the application on a different system.
I also plot the cumulative write syscall size as a function of time. In general this plot is useful because it can tell you when most of the write IO occurs. Figure 5 below plots the cumulative write syscall size in MB versus time. Please note that this is the cumulative data for all files used in the application. So it may not match the size of the files.

Figure 5: Cumulative Write syscall history.
The application is writing small amounts of data at periodic intervals. The amount of data is fairly small but you can see that it happens at regular intervals. These regular intervals make sense since the application is simulating the time history of a dynamic event. So it will write out the state of the object being simulated at fixed time intervals.
We can do the same thing for reads as we just did for writes. The same plot as Figure 3 can be created for read syscalls. This is shown below in Figure 6.

Figure 6: Read syscall and IOPS history.
This plot is a little different than the write plot (Figure 3). The application does most of the reading at the beginning of the run. This makes sense since the application has to load the input data before running. After the first few seconds of the application's run time, we can see that it does small reads at periodic intervals, closely matching the write intervals. These intervals also match the increased Read IOPS, although the values are very small (the maximum Read IOPS occurs at the beginning as one would expect).
The read throughput and Read IOPS are shown in Figure 7 below.

Figure 7: Read throughput and IOPS history.
The peak read throughput occurs near the beginning of the run as expected. The peak read throughput is about 1.4 GB/s. Again we used a single drive over NFS with GigE so 1.4 GB/s is a bit unreasonable indicating that again caching effects are involved (perhaps from the IO scheduler). After the initial read, the application reads at specified intervals. This intervals corresponds to small jumps in the Read IOPS as show in the bottom plot of Figure 7.
The cumulative read plot is show in Figure 8. This plot is the cumulative read total for all files.

Figure 8: Cumulative Read syscall history.
From the plot we can see that almost all of the reading happens at the beginning of the application. The reading that happens at certain intervals as show in Figures 6 and 7, also show here, but don't add much to total number of bytes read.
Strace_analyzer produces some additional data files for plotting of the file offset. The data file contains the location of the file pointer relative to the start of the file in bytes (this is the file offset) as a function of time. This type of plot is very useful for examining the access pattern of the application. Does the application access the data sequentially? Or does it do something different such as reading backwards and then writing ahead?
Figure 9 plots the file offset for the file e5_dellamd_8core.odb. This file is the primary output file for the application. Since the rank-0 process (12419) does the IO for the other processes, it makes sense to plot the offset for this file on process 12419.

Figure 9: File Offset for e5_dellamd_8core.odb.
What is very interesting in this figure is that at certain intervals the file offset is moved. It appears that the file offset is moved to the beginning of the file via lseek() syscalls (most likely) and moved forward either through reading, writing, or lseek syscalls. So at specified intervals the application lseeks to the beginning of the file and then either reads, writes, lseeks, or some combination of them to near the end of the file where data is written to the file.
While IO is not a big part of the execution of this application this access pattern does not help throughput. As time progresses, the size of the output file grows. Moving the file offset to the beginning pretty much guarantees that the data that is in the disk cache, the VFS cache, the IO scheduler cache, and the file system cache is worthless. This can be detrimental to performance.
Summary Observations of Abaqus/Explicit
There have been a number of tables and plots of the application as well as some comments. I wanted to pull everything together to give a summary of what we learned about this application.
- The application is definitely not IO limited
- It appears that the rank-0 process does all of the IO for the application
- The application reads and writes data at certain intervals. During these intervals the application lseeks to the beginning of the file and the moves the file offset to where it writes data. These intervals correspond to increased in Write IOPS, Read IOPS, and Total IOPS.
One of the curious things I didn't comment on earlier was that if the the rank-0 process is doing all of the IO, why do the other processes read and write data? We can gain some insight by looking at the file statistics output group for one of the non-zero processes. For example, here is the listing for process 14297 (again, my apologies for the width).
Filename Read Bytes Avg. Bytes/sec Write Bytes Avg. Bytes/sec
=================================================================================================================
/proc/14317/status 766 138,766,889.62 0 0
/dev/infiniband/uverbs0 0 65,677,860.45 5,260 1,332,205.04
/sys/class/infiniband_verbs/uverbs0/abi_version 2 896,428.57 0 0
/proc/cpuinfo 6,248 1,268,732.18 0 0
/application/Simulia/benchmark/e5_dellamd_8core.use.9 0 64,269,784.08 1,007 47,999,342.28
/application/Simulia/benchmark/e5_dellamd_8core.pac.9 1,884,160 495,357,157.13 0 0
/application/Simulia/benchmark/e5_dellamd_8core.sel.9 12,288 395,114,845.94 8,192 497,371,428.57
/application/Simulia/benchmark/e5_dellamd_8core.abq.9 3,612,672 687,962,921.56 7,348,224 537,920,980.74
/application/Simulia/benchmark/e5_dellamd_8core.msg.9 0 0 107 11,888,888.89
From the list of files you see that this process is actually reading and writing to files which look to be specific to that process.
- /application/Simulia/benchmark/e5_dellamd_8core.use.9
- /application/Simulia/benchmark/e5_dellamd_8core.pac.9
- /application/Simulia/benchmark/e5_dellamd_8core.sel.9
- /application/Simulia/benchmark/e5_dellamd_8core.abq.9
- /application/Simulia/benchmark/e5_dellamd_8core.msg.9
But you also see that it reads other files such as those in the /proc file system as well as the /dev file system (related to the InfiniBand network). So while the rank-0 process does most of the IO for the application, the other processes read and write to their own files.
Another interesting observation that one can make is that the number of open() syscalls do not match the number of close() syscalls? The explanation is fairly easy - sometimes the applications don't actually close the files they use.
Work in Progress Comments
The strace_analyzer is a work in progress and is being applied to other examples. Consequently, you will see lines of code commented out that have been used to debug the application (I loathe debuggers so I use print() statements). Over time these lines will disappear.
A current limitation of the application is that all the data structures contained the information from the strace file are stored in memory. So if your strace file is too large or has too many IO syscalls, then you can cause the application to swap. In the future, a database version of the application will be available so that much larger strace files can be examined (the largest file in this example had about 1.2 million lines and the system used for the analysis have 3GB of memory and the OS didn't swap).
If the strace files are getting too large you can use an additional option with strace to only capture the syscalls of interest. For example:
strace -T -ttt -e trace=open,close,read,write,lseek,lseek64,mkdir -o file.out
The additional option "-e" allows you to specify which syscalls you want to track eliminating syscalls that don't have anything to do with IO and reducing the size of the strace files.
Summary
This article, while a bit long, hopefully showed you that there is a way to process large strace files to examine IO behavior of applications. A simple application called strace_analyzer was written to do precisely that - examine strace files for IO syscalls and develop statistics around those syscalls. The application is a work in progress and it's still very early, but it has proven to be useful in several projects. The application is released under GPL v2 so you are free to use it as you like. If you have patches or comments, please send them along.
The strace_analyzer was applied to an MPI based HPC application in this article to illustrate what could be done. Using the data produced by the strace_analyzer we could examine lots of details about the IO pattern of the application. Plus we were able to plot the behavior of the IO patterns over time (a picture is worth a thousand words type of thing).
One final comment about IO profiling of applications - using strace to examine the IO syscalls in an application does change the timing behavior of the application. There is an overhead associated with using strace that changes the timings of the application. Moreover, the strace files have to be written to a file system somewhere, usually separate from the one being used by the application. This too can have an impact on the application behavior. However, strace is easy to use, comes with all distros, produces a huge amount of information, and is very useful for understanding how the application is performing IO. So it is something of a tradeoff; gaining insight into the IO behavior of the application while perhaps slightly altering it's timings. My personal opinion is that this is a worthwhile trade.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).
Comments on "IO Profiling of Applications: strace_analyzer"
I really enjoyed that read: what a useful tool. I will definitely be using this from now on. Many thanks!
Hi,
very nice tool! I have recently written very similar tool to do IO profiling based on strace. It is a GUI application with a core written in C (for speed) and the GUI in Python. The IOprofiler has integrated histogram and access pattern plotting. However it doesn’t (yet) provide all the statistics this tool can.
Check it out here, if you are interested: http://code.google.com/p/ioapps/
It can also replay all the traces to become a standalone benchmark with the same IO worload pattern as your original application.
Jeff, I am sorry to not continue on your code, but at least the ioreplay part must have been written in C (and thus all file descriptors handling as well). I’ve also integrated some more logic to file descriptor-to-file name mappings.
Jiri Horky
Dear Jeffrey
Can I get a copy of python version, please?
Regards,
Yonny
> Versions and Source for Python
> Coming soon – if you need it now, please contact me.