Skip to content

F.A.Q

How to rewrite common SQL bits with krangl?

  1. select this, that from there where that >5
there.select("this", "that").filter{ it["that"] gt 5 }

Why doesn't krangl provide vectorized comparison operators?

Some (+, -, *, !) can be overridden for collections, but others cannot (e.g. all arithmetic and boolean comparison ops)

No vectorization for >, && ==, etc. in table forumlas → Use function calls or not so pretty gt, AND, eq, etc.

Can we build data science workflows with Kotlin?

First, should we? Yes, because

  • R & Python fail to be scalable & robust solutions for data science
  • Java is known for great dependency tooling & scalability
  • Java as a language is less well suited for data-science (cluttered, legacy bits)

In Febuary 2018 Kotlin v1.0 was released. Designed with DSLs in mind it comes alongs With great features language such Type Inference, Extension Functions, Data Classes, or Default Parameters, making it a perfect choice to do data science on the JVM.

How does krangl compare to what R/dplyr or python/pandas?

flights
    .groupBy("year", "month", "day")
    .select({ range("year", "day") }, { listOf("arr_delay", "dep_delay") })
    .summarize(
            "mean_arr_delay" to { it["arr_delay"].mean(removeNA = true) },
            "mean_dep_delay" to { it["dep_delay"].mean(removeNA = true) }
    )
    .filter { (it["mean_arr_delay"] gt  30)  OR  (it["mean_dep_delay"] gt  30) }

And the same snippet written in dplyr:

flights %>%
    group_by(year, month, day) %>%
    select(year:day, arr_delay, dep_delay) %>%
    summarise(
        mean_arr_delay = mean(arr_delay, na.rm = TRUE),
        mean_dep_delay = mean(dep_delay, na.rm = TRUE)
    ) %>%
    filter(mean_arr_delay > 30 | mean_dep_delay > 30)

The biggest different are the comparison operators, which Kotlin does not allow to be overridden in a vectorized way.

And the same in pandas. {no clue, PR needed here!}

How to add columns totals to data-frame?

val foo = dataFrameOf(
    "Name", "Duration", "Color")(
    "Foo", 100, "Blue",
    "Goo", 200, "Red",
    "Bar", 300, "Yellow")

val columnTotals = foo.cols.map {
    it.name to when (it) {
        is IntCol -> it.sum()
        else -> null // ignored column types
    }
}.toMap().run {
    dataFrameOf(keys)(values)
}


bindRows(foo, columnTotals).print()

How to add a column at a certain index position?

fun DataFrame.addColumnAtIndex(columnName: String, index: Int, expression: TableExpression): DataFrame {
    return addColumn(columnName) { expression(ec, ec) }
        .select(names.take(index) + listOf(columnName) + names.takeLast(index))
}

irisData.addColumnAtIndex("foo", 1) { "krangl rocks!" }.print()

Further Reading?

For a first primer see KotlinConf 2019 slides about Data Science with kotlin