Skip to content

Introduction

Download Build Status Gitter

This is the manual of krangl.

krangl is an open-source {K}otlin library for data w{rangl}ing. By implementing a grammar of data manipulation using a modern functional-style API, it allows to filter, transform, aggregate and reshape tabular data.

krangl tries to become what pandas is for python, and readr+tidyr+dplyr are for R.

krangl is open-source and developed on github.

For a first primer see KotlinConf 2019 slides about Data Science with kotlin

Features

  • Filter, transform, aggregate and reshape tabular data
  • Modern, user-friendly and easy-to-learn data-science API
  • Reads from plain and compressed tsv, csv, json, or any delimited format with or without header from local or remote
  • Supports grouped operations
  • Ships with JDBC support
  • Tables can contain atomic columns (int, double, boolean) as well as object columns
  • Reshape tables from wide to long and back
  • Table joins (left, right, semi, inner, outer)
  • Cross tabulation
  • Descriptive statistics (mean, min, max, median, ...)
  • Functional API inspired by dplyr, pandas, and Kotlin stdlib

Furthermore, it provides methods to go back and forth between untyped and typed data.

Installation

To get started simply add it as a dependency:

repositories {
    mavenCentral()
}

dependencies {
    implementation "com.github.holgerbrandl:krangl:0.18.4"
}
Declaring the repository is purely optional as it is the default already.

If you're very new to Kotlin and Gradle you may want to read first about its basic syntax, some basic IDE features and about how to use gradle to configure dependencies in Kotlin projects.

Example

Flights that departed NYC, are grouped by date, some columns of interest are selected, dasummarized to reveal mean departure and arrival delays, and finally just those dates are kept that show extreme delays.

flights
    .groupBy("year", "month", "day")
    .select({ range("year", "day") }, { listOf("arr_delay", "dep_delay") })
    .summarize(
            "mean_arr_delay" to { it["arr_delay"].mean(removeNA = true) },
            "mean_dep_delay" to { it["dep_delay"].mean(removeNA = true) }
    )
    .filter { (it["mean_arr_delay"] gt  30)  OR  (it["mean_dep_delay"] gt  30) }