kscript as substitute for awk
Among other recently added features, kscript
does now accept scripts as arguments. Facilitated by its support library, this makes it possible to use it in an awk
-like fashion. And although kscript
is more designed for more complex self-contained longterm-stable installation-free micro-applications (see here or there for examples), it is still interesting to compare both tools with respect to tabular data processing.
So let’s get started. A common usecase for awk
is selecting columns:
To do the same with kscript
we can do the following
The kscript
solution is using Kotlin to implement the same functionality and is just slightly more verbose.
How does it work?
When a one-liner is provided as script argument to kscript
, it will add the following prefix header
//DEPS com.github.holgerbrandl:kscript:1.2
import kscript.text.*
val lines = resolveArgFile(args)
The header serves 2 purposes. First it imports the support methods from kscript.text
. Second, it resolves the data input which is assumed to be either an argument file or stdin
into a Sequence<String>
named lines
.
The resulting script will be processed like any other by kscript
.
In the example above several other elements of the kscript
support library are used:
split()
- Splits the lines of an input stream into Rows. The latter are just a delegate forList<String>
select()
- Allows to perform positive and negative column selection. Range and index syntax, and combinations of both are supported.print()
- Joins rows and prints them tostdout
Separator characters can be optionally provided and default (using kotlin default parameters) to tab-delimiter.
Examples
Note that kscript
is keeping the tab as a delimter for the output.
To allow for an easy transition between awk
and kscript.text.*
the API is using 1-based array access for the columns in the input Rows. I.e. the third column is selected with lines.split().map { it[3] }
and select(3)
.
As pointed out in the link, the awk
solution is flawed and may not work for all types of input data. There also does not seem to be a generic awk
solution to this problem. (cut
will do it though)
- Number lines (from here)
- Delete trailing white space (spaces, tabs)
- Print the lines from a file starting at the line matching “start” until the line matching “stop”:
- Print the last field in each line delimited by ‘:’
As shown in the examples, we can just use regular Kotlin to solve most awk
use-cases easily. And keep in mind that kscript
is not meant to be just a table processor, for which we pay here with an extra in verbosity. The latter could be refactored into more specialized support library methods if needed/wanted, but which is intended for now to improve readability.
Performance
To assess differences in runtime we use the initial column sub-setting example to process 300k flights
Both solutions do not differ signifcantly in runtime. However, this actually means that kscript
is processing the data faster, because we loose around 350ms for the JVM startup. To illustrate that point we redo the benchmark with 20x of the data.
For the tested usecase, kscript
seems more than 30% faster than awk
. Long live the JIT compiler! :-)
Conceptual Clarity vs. Convenience
One of the core motivations for the development of kscript
is long-term stability of kscript
lets. However, by adding a prefix header including a versioned dependency for the kscript support API we are somehow condemned to either stick to the current version of the support api for all times, or to hope that gradual improvements do not break existing kscript solutions. Neither option does sound appealing.
Because of that, we still consider to replace/drop the support for automatic prefixing of one-liners. The more verbose solution including the prefix-header would truely self-contained (and thus long-term stable) even if we evolve the support API, but for sure conciseness would suffer a lot. See yourself:
vs.
Readability may be even better in the latter case because it is a self-contained Kotlin application.
Opinions and suggestions on this feature are welcome!
Summary
As we have discussed above, kscript
can be used as a drop-in replacement for awk
in situations where awk
solutions would become overly clumsy. By allowing for standard Kotlin to write little pieces of shell processing logic, we can avoid installing external dedicated tools in many situations. Although, kscript
s written in Kotlin are slightly more verbose than awk
code, they are more readable and allow to express more complex data flow logic.
Whereas as table streaming is certainly possible with kscript
and beneficial in some situations, its true power is the handling of more complex data-types, such as json, and xml, and domain specific data like fasta or alignment files in bioinformatics. Because of the built-in dependency resolution in kscript
third party libraries can be easily used in short self-contained mini-programs, which allows to cover a wide range of application domains. We plan to discuss more examples in our next article.
Thanks for reading, and feel welcome to post questions or comments.