Recently I wrote a sequence on utilizing Groovy scripts to scrub up the tags in my music recordsdata. I developed a framework that acknowledged the construction of my music listing and used it to iterate over the content material recordsdata. In the ultimate article of that sequence, I separated this framework right into a utility class that my scripts might use to course of the content material recordsdata.
This separate framework jogged my memory a whole lot of the way in which awk works. For these of you unfamiliar with awk, you may profit from Opensource.com’s eBook, A practical guide to learning awk.
I’ve used awk extensively since 1984, when our little firm purchased its first “real” pc, which ran System V Unix. For me, awk was a revelation: It had associative reminiscence— suppose arrays listed by strings as a substitute of numbers. It had common expressions inbuilt, appeared designed to take care of knowledge, particularly in columns, and was compact and simple to study. Finally, it was designed to work in Unix pipelines, studying its knowledge from normal enter or recordsdata and writing to output, with no ceremony required to take action—knowledge simply appeared within the enter stream.
To say that awk has been an important a part of my day-to-day computing toolkit is an understatement. And but there are some things about how I exploit awk that go away me unhappy.
Probably the principle subject is that awk is sweet at coping with knowledge introduced in delimited fields however curiously not good at dealing with comma-separated-value recordsdata, which might have subject delimiters embedded inside a subject, supplied that the sphere is quoted. Also, common expressions have moved on since awk was invented, and needing to recollect two units of normal expression syntax guidelines shouldn’t be conducive to bug-free code. One set of such rules is bad enough.
Because awk is a small language, it is lacking some issues that I generally discover helpful, like a richer assortment of base sorts, constructions, change statements, and so forth.
In distinction, Groovy has all of those good issues: entry to the OpenCSV library, which facilitates coping with CSV recordsdata, Java common expressions and nice matching operators, a wealthy assortment of base sorts, courses, change statements, and extra.
What Groovy lacks is the easy pipeline-oriented view of knowledge as an incoming stream and processed knowledge as an outgoing stream.
But my music listing processing framework made me suppose, perhaps I can create a Groovy model of awk’s “engine”. That’s my goal for this text.
Install Java and Groovy
Groovy relies on Java and requires a Java set up. Both a latest and first rate model of Java and Groovy could be in your Linux distribution’s repositories. Groovy will also be put in following the directions on the Groovy homepage. A pleasant various for Linux customers is SDKMan, which can be utilized to get a number of variations of Java, Groovy and plenty of different associated instruments. For this text, I’m utilizing SDK’s releases of:
- Java: model 11.0.12-open of OpenJDK 11;
- Groovy: model 3.0.8.
Creating awk with Groovy
The primary thought right here is to encapsulate the complexities of opening a file or recordsdata for processing, splitting the road into fields, and offering entry to the stream of knowledge in three elements:
- Before any knowledge is processed
- On every line of knowledge
- After all knowledge is processed
I’m not going for the final case of changing awk with Groovy. Instead, I’m working towards my typical use case, which is:
- Use a script file quite than having the code on the command line
- Process a number of enter recordsdata
- Set my default subject delimiter to
|
and cut up traces learn on that delimiter - Use OpenCSV to do the splitting (what I am unable to do in awk)
The framework class
Here’s the “awk engine” in a Groovy class:
1 @Grab('com.opencsv:opencsv:5.6')
2 import com.opencsv.CSVReader
3 public class AwkEngine {
4 // With admiration and respect for
5 // Alfred Aho
6 // Peter Weinberger
7 // Brian Kernighan
8 // Thank you for the big worth
9 // introduced my job by the awk
10 // programming language
11 Closure onBegin
12 Closure onEachLine
13 Closure onEnd14 personal String fieldSeparator
15 personal boolean isFirstLineHeader
16 personal ArrayList<String> fileNameList
17 public AwkEngine(args) "
20 this.isFirstLineHeader = false
21
22 public AwkEngine(args, fieldSeparator) {
23 this.fileNameList = args
24 this.fieldSeparator = fieldSeparator
25 this.isFirstLineHeader = false
26 }
27 public AwkEngine(args, fieldSeparator, isFirstLineHeader) {
28 this.fileNameList = args
29 this.fieldSeparator = fieldSeparator
30 this.isFirstLineHeader = isFirstLineHeader
31 }
32 public void go() {
33 this.onBegin()
34 int recordNumber = 0
35 fileNameList.every { fileName ->
36 int fileRecordNumber = 0
37 new File(fileName).withReader { reader ->
38 def csvReader = new CSVReader(reader,
39 this.fieldSeparator.charAt(0))
40 if (isFirstLineHeader) {
41 def csvFieldNames = csvReader.readNext() as
42 ArrayList<String>
43 csvReader.every { fieldsByNumber ->
44 def fieldsByName = csvFieldNames.
45 withIndex().
46 collectEntries { title, index ->
47 [name, fieldsByNumber[index]]
48 }
49 this.onEachLine(fieldsByName,
50 recordNumber, fileName,
51 fileRecordNumber)
52 recordNumber++
53 fileRecordNumber++
54 }
55 } else {
56 csvReader.every { fieldsByNumber ->
57 this.onEachLine(fieldsByNumber,
58 recordNumber, fileName,
59 fileRecordNumber)
60 recordNumber++
61 fileRecordNumber++
62 }
63 }
64 }
65 }
66 this.onEnd()
67 }
68 }
While this seems to be like a good bit of code, lots of the traces are continuations of a cut up longer traces (for instance, usually you’d mix traces 38 and 39, traces 41 and 42, and so forth). Let’s have a look at this line by line.
Line 1 makes use of the @Grab
annotation to fetch the OpenCSV library model 5.6 from Maven Central. No XML required.
In line 2, I import OpenCSV’s CSVReader
class.
In line 3, simply as with Java, I declare a public utility class, AwkEngine
.
Lines 11-13 outline the Groovy Closure situations utilized by the script as hooks into this class. These are “public by default” as is the case with any Groovy class—however Groovy creates the fields as personal and exterior references to those (utilizing getters and setters supplied by Groovy). I’ll clarify that additional within the pattern scripts under.
Lines 14-16 declare the personal fields—the sphere separator, a flag to point whether or not the primary line of a file is a header, and a listing for the file title.
Lines 17-31 outline three constructors. The first receives the command line arguments. The second receives the sphere separator character. The third receives the flag indicating whether or not the primary line is a header or not.
Lines 31-67 outline the engine itself, because the go()
methodology.
Line 33 calls the onBegin()
closure (equal to the awk BEGIN {}
assertion).
Line 34 initializes the recordNumber
for the stream (equal to the awk NR
variable) to 0 (notice I’m doing 0-origin right here quite than the awk 1-origin).
Lines 35-65 use every {}
to loop over the checklist of recordsdata to be processed.
Line 36 initializes the fileRecordNumber
for the file (equal to the awk FNR
variable) to 0 (0-origin, not 1-origin).
Lines 37-64 get a Reader
occasion for the file and course of it.
Lines 38-39 get a CSVReader
occasion.
Line 40 checks to see whether or not the primary line is being handled as a header.
If the primary line is being handled as a header, then traces 41-42 get the checklist of subject header names from the primary file.
Lines 43-54 course of the remainder of the data.
Lines 44-48 copy the sphere values into the map of title:worth
.
Lines 49-51 name the onEachLine()
closure (equal to what seems in an awk program between BEGIN {}
and END {}
, although no sample will be connected to make the execution conditional), passing within the map of title:worth
, the stream file quantity, the file title and the file file quantity.
Lines 52-53 increment the stream file quantity and file file quantity.
Otherwise:
Lines 56-62 course of the data.
Lines 57-59 name the onEachLine()
closure, passing within the array of subject values, the stream file quantity, the file title and the file file quantity.
Lines 60-61 increment the stream file quantity and file file quantity.
Line 66 calls the onEnd()
closure (equal to the awk END {}
).
That’s it for the framework. Now you may compile it:
$ groovyc AwkEngine.groovy
A few feedback:
If an argument is handed in that’s not a file, the code fails with a normal Groovy stack hint, which seems to be one thing like this:
Caught: java.io.FileNotFoundException: not-a-file (No such file or listing)
java.io.FileNotFoundException: not-a-file (No such file or listing)
at AwkEngine$_go_closure1.doCall(AwkEngine.groovy:46)
OpenCSV tends to return String[]
values, which aren’t as handy as List
values in Groovy (for instance there isn’t a every {}
outlined for an array). Lines 41-42 convert the header subject worth array into a listing, so maybe fieldsByNumber
in line 57 must also be transformed into a listing.
Using the framework in scripts
Here’s a quite simple script utilizing AwkEngine
to look at a file like /and so forth/group
, which is colon-delimited and has no header:
1 def ae = new AwkEngine(args, ‘:')
2 int lineCount = 03 ae.onBegin = {
4 println “in begin”
5 }6 ae.onEachLine = { fields, recordNumber, fileName, fileRecordNumber ->
7 if (lineCount < 10)
8 println “fileName $fileName fields $fields”
9 lineCount++
10 }11 ae.onEnd = {
12 println “in end”
13 println “$lineCount line(s) read”
14 }15 ae.go()
Line 1 calls the two-argument constructor, passing within the argument checklist and the colon as delimiter.
Line 2 defines a script top-level variable, lineCount
, used to file the rely of traces learn (notice that Groovy closures do not require variables outlined exterior to the closure to be closing).
Lines 3-5 outline the onBegin()
closure, which simply prints the string “in begin” on normal output.
Lines 6-10 outline the onEachLine()
closure, which prints the file title and the fields for the primary 10 traces and in any case increments the road rely.
Lines 11-14 outline the onEnd()
closure, which prints the string “in end” and the rely of the variety of traces learn.
Line 15 runs the script utilizing the AwkEngine
.
Run this script as follows:
$ groovy Test1Awk.groovy /and so forth/group
in start
fileName /and so forth/group fields [root, x, 0, ]
fileName /and so forth/group fields [daemon, x, 1, ]
fileName /and so forth/group fields [bin, x, 2, ]
fileName /and so forth/group fields [sys, x, 3, ]
fileName /and so forth/group fields [adm, x, 4, syslog,clh]
fileName /and so forth/group fields [tty, x, 5, ]
fileName /and so forth/group fields [disk, x, 6, ]
fileName /and so forth/group fields [lp, x, 7, ]
fileName /and so forth/group fields [mail, x, 8, ]
fileName /and so forth/group fields [news, x, 9, ]
in finish
78 line(s) learn
$
Of course the .class
recordsdata created by compiling the framework class have to be on the classpath for this to work. Naturally, you might use jar
to package deal up these class recordsdata.
I actually like Groovy’s help for the delegation of conduct, which requires varied shenanigans in different languages. For a few years Java required nameless courses and fairly a bit of additional code. Lambdas have gone an extended method to fixing this, however they nonetheless can not discuss with non-final variables exterior their scope.
Here’s one other, extra fascinating script that could be very paying homage to my typical use of awk:
1 def ae = new AwkEngine(args, ‘;', true)
2 ae.onBegin = {
3 // nothing to do right here
4 }5 def regionCount = [:]
6 ae.onEachLine = { fields, recordNumber, fileName, fileRecordNumber ->
7 regionCount[fields.REGION] =
8 (regionCount.containsKey(fields.REGION) ?
9 regionCount[fields.REGION] : 0) +
10 (fields.PERSONAS as Integer)
11 }12 ae.onEnd = {
13 regionCount.every { area, inhabitants ->
14 println “Region $region population $population”
15 }
16 }17 ae.go()
Line 1 calls the three-argument constructor, recognizing that this can be a “true CSV” file with the header being on the primary line. Because it is a Spanish file, the place the comma is used because the decimal “point”, the usual delimiter is the semicolon.
Lines 2-4 outline the onBegin()
closure which on this case does not do something.
Line 5 defines an (empty) LinkedHashMap
, which you’ll fill with String keys and Integer values. The knowledge file is from Chile’s most up-to-date census and you’re calculating the variety of individuals in every area of Chile on this script.
Lines 6-11 processes the traces within the file (there are 180,500 together with the header)—notice that on this case, since you are defining line 1 because the CSV column headers, the fields parameter goes to be an occasion of LinkedHashMap<String,String>
.
Lines 7-10 increment the regionCount
map, utilizing the worth within the subject REGION as the important thing and the worth within the subject PERSONAS as the worth—notice that, in contrast to awk, in Groovy you may’t discuss with a non-existent map entry on the right-hand facet and anticipate a clean or zero worth to materialize.
Lines 12- 16 print out inhabitants by area.
Line 17 runs the script on the AwkEngine
occasion.
Run this script as follows:
$ groovy Test2Awk.groovy ~/Downloads/Censo2017/ManzanaEntidad_CSV/Censo*csv
Region 1 inhabitants 330558
Region 2 inhabitants 607534
Region 3 inhabitants 286168
Region 4 inhabitants 757586
Region 5 inhabitants 1815902
Region 6 inhabitants 914555
Region 7 inhabitants 1044950
Region 8 inhabitants 1556805
Region 16 inhabitants 480609
Region 9 inhabitants 957224
Region 10 inhabitants 828708
Region 11 inhabitants 103158
Region 12 inhabitants 166533
Region 13 inhabitants 7112808
Region 14 inhabitants 384837
Region 15 inhabitants 226068
$
That’s it. For these of you who love awk and but would really like a bit of extra, I hope you take pleasure in this Groovy strategy.