kscript as substitute for awk

Among other recently added features, kscript does now accept scripts as arguments. Facilitated by its support library, this makes it possible to use it in an awk-like fashion. And although kscript is more designed for more complex self-contained longterm-stable installation-free micro-applications (see here or there for examples), it is still interesting to compare both tools with respect to tabular data processing.

So let’s get started. A common usecase for awk is selecting columns:

# fetch some example data
# wget -O flights.tsv https://git.io/v9MjZ
# head  -n 5 flights.tsv > some_flights.tsv 
awk -v OFS='\t' '{print $10, $1, $12}' some_flights.tsv

## carrier	year	tailnum
## UA	2013	N14228
## UA	2013	N24211
## AA	2013	N619AA
## B6	2013	N804JB

To do the same with kscript we can do the following

kscript -t 'lines.split().select(10,1,12).print()' some_flights.tsv

## carrier	year	tailnum
## UA	2013	N14228
## UA	2013	N24211
## AA	2013	N619AA
## B6	2013	N804JB

The kscript solution is using Kotlin to implement the same functionality and is just slightly more verbose.

How does it work?

When a one-liner is provided as script argument to kscript, it will add the following prefix header

//DEPS com.github.holgerbrandl:kscript:1.2
import kscript.text.*
val lines = resolveArgFile(args)

The header serves 2 purposes. First it imports the support methods from kscript.text. Second, it resolves the data input which is assumed to be either an argument file or stdin into a Sequence<String> named lines.

The resulting script will be processed like any other by kscript.

In the example above several other elements of the kscript support library are used:

split() - Splits the lines of an input stream into Rows. The latter are just a delegate for List<String>
select() - Allows to perform positive and negative column selection. Range and index syntax, and combinations of both are supported.
print() - Joins rows and prints them to stdout

Separator characters can be optionally provided and default (using kotlin default parameters) to tab-delimiter.

Examples

Add a new column to a file:

awk '{print $1, $2, "F11-"$7}' some_flights.tsv

## year month F11-arr_time
## 2013 1 F11-830
## 2013 1 F11-850
## 2013 1 F11-923
## 2013 1 F11-1004

kscript -t 'lines.split().map { listOf(it[1], it[2], "F11-"+ it[7]) }.print()' some_flights.tsv

## year	month	F11-arr_time
## 2013	1	F11-830
## 2013	1	F11-850
## 2013	1	F11-923
## 2013	1	F11-1004

Note that kscript is keeping the tab as a delimter for the output.

To allow for an easy transition between awk and kscript.text.* the API is using 1-based array access for the columns in the input Rows. I.e. the third column is selected with lines.split().map { it[3] } and select(3).

Delete a column

awk '!($3="")'  some_flights.tsv

kscript -t 'lines.split().select(-3).print()' some_flights.tsv

As pointed out in the link, the awk solution is flawed and may not work for all types of input data. There also does not seem to be a generic awk solution to this problem. (cut will do it though)

Number lines (from here)

 awk '{print FNR "\t" $0}'  some_flights.tsv
 
 kscript -t 'lines.mapIndexed { num, line -> num.toString() + " " + line }.print()'  some_flights.tsv

Delete trailing white space (spaces, tabs)

awk '{sub(/[ \t]*$/, "");print}' file.txt

kscript -t 'lines.map { it.trim() }.print()' file.txt

Print the lines from a file starting at the line matching “start” until the line matching “stop”:

awk '/start/,/stop/' file.txt

kscript -t 'lines.dropWhile { it.startsWith("start") }.takeWhile { !it.startsWith("stop") }.print()' file.txt

Print the last field in each line delimited by ‘:’

awk -F: '{ print $NF }' file.txt
kscript -t 'lines.split(":").map { it[it.size - 1] }.print()' file.txt

Prints Record(line) number, and number of fields in that record

awk '{print NR,"->",NF}' file.txt

kscript -t 'lines.split().mapIndexed { index, row -> "$index -> " + row.size }.print()'

As shown in the examples, we can just use regular Kotlin to solve most awk use-cases easily. And keep in mind that kscript is not meant to be just a table processor, for which we pay here with an extra in verbosity. The latter could be refactored into more specialized support library methods if needed/wanted, but which is intended for now to improve readability.

Performance

To assess differences in runtime we use the initial column sub-setting example to process 300k flights

wc -l flights.tsv
time awk '{print $10, $1, $12}' flights.tsv > /dev/null

##   336777 flights.tsv
## 
## real	0m1.778s
## user	0m1.741s
## sys	0m0.023s

time kscript -t 'lines.split().select(10,1,12).print()' flights.tsv > /dev/null

## 
## real	0m1.671s
## user	0m1.951s
## sys	0m0.379s

Both solutions do not differ signifcantly in runtime. However, this actually means that kscript is processing the data faster, because we loose around 350ms for the JVM startup. To illustrate that point we redo the benchmark with 20x of the data.

# moreFlights="flights.tsv flights.tsv flights.tsv flights.tsv flights.tsv"
# cat ${moreFlights} ${moreFlights} ${moreFlights} ${moreFlights} > many_flights.tsv
time awk '{print $10, $1, $12}' many_flights.tsv > /dev/null

## 
## real	0m36.126s
## user	0m35.220s
## sys	0m0.553s

time kscript -t 'lines.split().select(10,1,12).print()' many_flights.tsv > /dev/null

## 
## real	0m21.544s
## user	0m17.962s
## sys	0m4.802s

For the tested usecase, kscript seems more than 30% faster than awk. Long live the JIT compiler! :-)

Conceptual Clarity vs. Convenience

One of the core motivations for the development of kscript is long-term stability of kscriptlets. However, by adding a prefix header including a versioned dependency for the kscript support API we are somehow condemned to either stick to the current version of the support api for all times, or to hope that gradual improvements do not break existing kscript solutions. Neither option does sound appealing.

Because of that, we still consider to replace/drop the support for automatic prefixing of one-liners. The more verbose solution including the prefix-header would truely self-contained (and thus long-term stable) even if we evolve the support API, but for sure conciseness would suffer a lot. See yourself:

kscript -t 'lines.split().select(with(1..3).and(3)).print()' file.txt

vs.

kscript -t '//DEPS com.github.holgerbrandl:kscript:1.2
import kscript.text.*
val lines = resolveArgFile(args)

lines split().select(with(1..3).and(3)).print()
'

Readability may be even better in the latter case because it is a self-contained Kotlin application.

Opinions and suggestions on this feature are welcome!

Summary

As we have discussed above, kscript can be used as a drop-in replacement for awk in situations where awk solutions would become overly clumsy. By allowing for standard Kotlin to write little pieces of shell processing logic, we can avoid installing external dedicated tools in many situations. Although, kscripts written in Kotlin are slightly more verbose than awk code, they are more readable and allow to express more complex data flow logic.

Whereas as table streaming is certainly possible with kscript and beneficial in some situations, its true power is the handling of more complex data-types, such as json, and xml, and domain specific data like fasta or alignment files in bioinformatics. Because of the built-in dependency resolution in kscript third party libraries can be easily used in short self-contained mini-programs, which allows to cover a wide range of application domains. We plan to discuss more examples in our next article.

Thanks for reading, and feel welcome to post questions or comments.