Groovy came in handy for me once again last week when I was tasked with aggregating performance data from application logs. The logs were concatenated together into one file that totalled over 2 million lines. Interspersed throughout the logs were performance timer entries. Each entry included a timestamp, a string for tagging the line as a performance timer entry, the name of the object and method called, and the number of milliseconds elapsed for the method's execution.

Performance tests were run over a two-day period under various loads. For analysis purposes we wanted to see each method's average run time for each set of conditions, so we needed a way to look at aggregated data for specified time periods. The results would be output to a CSV format for easy integration with an Excel worksheet. Finally, we didn't want to spend too much time on creating a script since it wouldn't really be reuseable outside the context of this project, due to the specific formatting of the performance timing entries.

  • I'm comfortable with Java, and Groovy is a superset of Java.
  • It's a scripting language, so I can have code that executes outside of an object model.
  • Supports closures as well as some handy convenience methods.
  • Allows me to use objects and classes as necessary.

I created a Groovy script that took four arguments:

  • Start time
  • End time
  • Input file (the log)
  • Output file

The script reads through every line in the log file. Working with file reader input streams is always a fairly straightforward exercise, but it's even quicker and easier with Groovy:

File file = new File(inFile) // inFile is a string file name
file.eachLine(parse) // parse is the name of a closure

Inside the closure, the it keyword is used to refer to the line (String) that was passed in:

nTimer = it.indexOf("PERF-TIMER]") + 12 // length of "PERF-TIMER"

The parse closure is where all the action is. The algorithm is specific to the unique performance timer entries for this project, checking each line for a "PERF-TIMER" string, then for the timestamp. It then determines if the timestamp falls between the times specified by the input arguments, and proceeds from there.

Outside the closure, I defined a class to store method data:

class MethodData {
 long nCount = 0
 long msTotal = 0 // total milliseconds
 String methodName
 
 void add(String ms) {
 nCount++
 int i = Integer.parseInt(ms.trim())
 msTotal += i
 }
 
 double avg() {
 return msTotal / nCount
 }
 
 String toString() {
 return methodName + "," + nCount + "," + msTotal + "," + avg() + "\n"
 }
}

Back inside the closure, the method name is extracted from the performance timer entry and used as a key for a Map object. If data has already been captured for a given method, the milliseconds data is just added to the total. Otherwise, a new object is created and put into the Map:

if (map.containsKey(strMethodName)) {
 (MethodData)(map.get(strMethodName)).add(strMilliseconds)
} else {
 data = new MethodData()
 data.methodName = strMethodName
 data.add(strMilliseconds)
 map.put(strMethodName, data)
}

All this working with performance data got me thinking: how would the Groovy script itself perform?

Running the script from an Eclipse console on a MacBook Pro (4 GB memory), it went through a 260 MB log file with 2.003 million lines in just over 64 seconds. That's about 4 MB per second, or .03 milliseconds per record. Not bad, certainly acceptable for my purposes.