In common bioinformatics/scientific workflows black magic bash hacking skills are often required and used to process data. This is because these workflows tend to live in the bash-shell, and for many data processing tasks there is no actual direct tool. To make data flow nevertheless solution repositories like biostar or stackoverflow tend to suggest crazy combinations of perl, grep, awk or sed mixed with various amounts of bash. Such solutions often lack engineering quality, depend on specific platforms and versions, and tend to very cryptic even for well-trained data monkeys. More high-level solution in python or R could be used as well, but hardly ever run without additional setup efforts.

To overcome this problem, I’ve evolved a small extension tool to named kscript over the last months. It allows to easily embed Kotlin scriptlets into the shell, and ships with features like compile-jar-caching and automatic dependency resolution. Just see its github page for details and examples.

Although kscript has helped to create readable kotlin solutions for a range of data flow problems, it was still a bit tedious to ship around developed solutions in a concise and user-friendly manner. To overcome this problem, I’ve recently enhanced kscript to also allow for URLs as scriptlet sources.

Let’s do an example. Imagine you’d need to filter a list of genomic sequences saved in a fasta-formatted file by length. This is how the data may look like:

>HSBGPG Human gene for bone gla protein (BGP)
GGCAGATTCCCCCTAGACCCGCCCGCACCATGGTCAGGCATGCCCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
ATAAACAGTGCTGGAGGCTGGCGGGGCAGGCCAGCTGAGTCCTGAGCAGCAGCCCAGCGCAGCCACCGAGACACC
CTCCAGGCACCCTTCTTTCCTCTTCCCCTTGCCCTTGCCCTGACCTCCCAGCCCTATGGATGTGGGGTCCCCATC
ATCCCAGCTGCTCCCAAATAAACTCCAGAAG
>HSGLTH1 Human theta 1-globin gene
CCACTGCACTCACCGCACCCGGCCAATTTTTGTGTTTTTAGTAGAGACTAAATACCATATAGTGAACACCTAAGA
CGGGGGGCCTTGGATCCAGGGCGATTCAGAGGGCCCCGGTCGGAGCTGTCGGAGATTGAGCGCGCGCGGTCCCGG
GATCTCCGACGAGGCCCTGGACCCCCGGGCGGCGAAGCTGCGGCGCGGCGCCCCCTGGAGGCCGCGGGACCCCTG
TCAGCCCCGCGCTGCAGGCGTCGCTGGACAAGTTCCTGAGCCACGTTATCTCGGCGCTGGTTTCCGAGTACCGCT
GAACTGTGGGTGGGTGGCCGCGGGATCCCCAGGCGACCTTCCCCGTGTTTGAGTAAAGCCTCTCCCAGGAGCAGC
CTTCTTGCCGTGCTCTCTCGAGGTCAGGACGCGAGAGGAAGGCGC
>ARGH1 Transcriptional regulator
ATCCAGGGCGATTCAGAGGGCCCCGGGCCACGTTATCTCGGCGCTGGTTTCGCGCTGGTTTCGCGCTGGTTTCGC
CTGGTTTCC

For sure, there are many different ways to solve this problem. E.g. using a tool called samtools faidx with downstream filtering of its output using awk. Or by reformating the multi-line fasta into single-line via perl plus some awk and so on. BioPyton or BioPerl also do the trick, are very readable, but require installation and additional setup efforts.

To allow for 0-installation scriptlets that do their own automatic dependency resolution, kscript comes to rescue. Here’s a Kotlin solution for the filter problem from above, which we’ll work through step by step:

 //DEPS de.mpicbg.scicomp:kutils:0.4
 //KOTLIN_OPTS -J-Xmx2g
 
 import de.mpicbg.scicomp.bioinfo.openFasta
 import java.io.File
 import kotlin.system.exitProcess
 
 if (args.size != 2) {
     System.err.println("Usage: fasta_filter <fasta> <length_cutoff>")
     exitProcess(-1)
 }
 
 val fastaFile = File(args[0])
 val lengthCutoff = args[1].toInt()
 
 
 openFasta(fastaFile).
     filter { it.sequence.length >= lengthCutoff }.
     forEach { print(it.toEntryString()) }
  • First a single dependency is added that provides a Kotlin API to parse fasta-formatted data.
  • Because some of the sequences might be large we want to run with 2gb of memory.
  • Since it is supposed to be a self-contained mini-program it ships a simplistic CLI involving just an input file and the length-cutoff.
  • The implementation is straightforward: We create an iterator over the entries in the fasta-file, which is filtered by length, and the remaining entries are printed to stdout.

Because of Kotlin’s nice and concise collections API and syntax, the solution should be readable almost even to non-developers. ;-)

Previously kscript allowed to run such scriptlets either from a file or inlined into a workflow-script (see its docs for examples), or most naturally by by declaring kscript as scripting engine in the shebang line:

#!/usr/bin/env kscript

//DEPS de.mpicbg.scicomp:kutils:0.4
//KOTLIN_OPTS -J-Xmx2g
 
import de.mpicbg.scicomp.bioinfo.openFasta
// ... rest from above

However, since recently kscript also can read URLs which elevates its usage to a new level: The fasta length filter scriplet from above was deposited as a gist on github, and we can now simply write

kscript https://git.io/v1ZUY test.fasta 20 > filtered.fasta

To further increase readability and convenience, we can alias the first part

alias fasta_length_filter="kscript https://git.io/v1ZUY"

fasta_length_filter ~/test.fasta 20 > filtered.fasta

When being invoked without arguments fasta_length_filter will provide just the usage info.

Depending on the users’ preference the URL could point either to the master revision of the gist or to a particular revision for better reproducibility. Since the scriplet is versioned along with its dependencies (which are fetched via maven), this approach does not depend on API stability for the external libraries being used – a common problem which is tedious to overcome when working with python, R or perl! In contrast kscript solutions provide absolute long-term stability (within the limits of a hardly ever changing JVM and the kotlin compiler).

In this post I’ve discussed how to write versioned, bootstrapping mini-programs using kscript and Kotlin. In my next posting I’ll talk about the support APi for kscript to allow for awk-like one-liners written in Kotlin to please they eye instead of being confusing and aw(k)ful.