Add columns with addColumn
val df: DataFrame = dataFrameOf(
"first_name", "last_name", "age", "weight")(
"Max", "Doe", 23, 55,
"Franz", "Smith", 23, 88,
"Horst", "Keanes", 12, 82)
df.addColumn("salary_category") { 3 } // add constants
df.addColumn("age_3y_later") { it["age"] + 3 } // do basic column arithmetics
// krangl dataframes are immutable so we need to (re)assign results to preserve changes.
val newDF = df.addColumn("full_name") { it["first_name"] + " " + it["last_name"] }
// krangl overloads arithmetic operators like + for dataframe-columns
df.addColumn("user_id") { it["last_name"] + "_id" + rowNumber }
//and provides convenience methods to ignore NAs
df.addColumn("first_name_initial") { it["first_name"].map<String>{ it.first() } }
// or add multiple columns at once
df.addColumns(
"age_plus3" to { it["age"] + 3 },
"initial" to { it["first_name"].map<String> { it.first() } }
)
To make create columns starting with constant values those need to be expanded to static columns using with const
df.createColumn("user_id") { const("id") + nrow }
Get your data in order with sortedBy
df.sortedBy("age")
// and add secondary sorting attributes as varargs
df.sortedBy("age", "weight")
// reverse sorting order
df.sortedByDescending("age")
df.sortedBy{ desc("age") }
// sort descending by age, and resolve ties by weight
df.sortedBy({ desc(it["age"]) }, { it["weight"] })
// sort with indicator lambda
df.sortedBy { it["weight"].round() }
???
mimic Kotlin stdlib where possible
Subset variables with select
// positive selection
df.select("last_name", "weight")
// negative selection
df.remove("weight", "age")
// selector mini-language
df.select { endsWith("name") }
df.select { matches("foo[0-9") }
// functional style column selection
// odd name to avoid JVM signature clash (help welcome!)
df.select2 { it is IntCol }
// rename columns
df.rename("last_name" to "Nachname")
{% hint style="warning" %}
Be aware that the usage of string constants as function literals is legit kotlin code sleepData.sortedBy{ "order" }
but lacks select semantics. krangl
will throw an error in such a case.
{% endhint %}
Subset your records with filter
// Subset rows with vectorized filter
df.filter { it["age"] eq 23 }
df.filter { it["weight"] gt 50 }
df.filter({ it["last_name"].isMatching { startsWith("Do") }})
In case vectorized operations are not possible or available we can also filter tables by row which allows for scalar operators
df.filterByRow { it["age"] as Int > 5 }
df.filterByRow { (it["age"] as Int).rem(10) == 0 } // "round" birthdays :-)
Summarize your data with summarize
// do simple cross tabulations
df.count("age", "last_name")
// ... or calculate single summary statistic
df.summarize("mean_age") { it["age"].mean(true) }
// ... or multiple summary statistics
df.summarize(
"min_age" to { it["age"].min() },
"max_age" to { it["age"].max() }
)
// for sake of r and python transition you can also use `=` here
df.summarize(
"min_age" `=` { it["age"].min() },
"max_age" `=` { it["age"].max() }
)
Perform grouped operations after groupBy
val groupedDf: DataFrame = df.groupBy("age")
// ... or provide multiple grouping attributes with varargs
val sumDF = groupedDf.summarize(
"mean_weight" to { it["weight"].mean(removeNA = true) },
"num_persons" to { nrow }
)
// Optionally ungroup the data
sumDF.ungroup()
Bring it all together
flightsData
.groupBy("year", "month", "day")
.select({ range("year", "day") }, { listOf("arr_delay", "dep_delay") })
.summarize(
"mean_arr_delay" `=` { it["arr_delay"].mean(removeNA = true) },
"mean_dep_delay" to { it["dep_delay"].mean(removeNA = true) }
)
.filter { (it["mean_arr_delay"] gt 30) OR (it["mean_dep_delay"] gt 30) }
.sortedBy("mean_arr_delay")
year month day mean_arr_delay mean_dep_delay
2013 10 11 18.9229 31.2318
2013 5 24 24.2574 30.3407
2013 6 2 26.075 34.0133
2013 6 26 27.3174 30.61175
2013 6 10 28.0222 30.61945
2013 7 8 29.6488 37.2966
2013 8 22 29.9767 33.6004
2013 2 27 31.252 37.7632
Note
Both =
and to
are supported in table expressions.
Summarize Data with summarize
Examples
- Add a suffix to some column names
// first select column names to be altered irisData.names.filter { it.startsWith("Sepal") }.map { // second, apply renaming oldName -> irisData.rename(oldName to ("My" + oldName)) }