Mashing data at CLI with Datamash


2014-08-08

datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.

Installing

On OSX

brew install datamash

On Debian/Ubuntu

wget http://files.housegordon.org/datamash/bin/datamash_1.0.6-1_amd64.deb
sudo dpkg -i datamash_1.0.6-1_amd64.deb

On Fedora/RedHat/CentOS

wget http://files.housegordon.org/datamash/bin/datamash-1.0.6-1.el6.x86_64.rpm
sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm

Basic examples

Let’s start with simple examples to get some intuition for how the tool works.

Sum of all numbers from 1 to 20

seq 20 | datamash sum 1
210

Count, sum and get the the mean of even numbers from 1 to 20

seq 20 | paste - - | datamash sum 2 count 2 mean 2
110  10   11

Sum of the numbers divisible by 3 between 1 to 20

seq 18 | paste - - | datamash sum 3
63

In this example I had to change 20 to 18 so datamash could calculate the sum properly. The reason « why » is left as an exercise for the reader.

Transpose the matrix

seq 18 | paste - - - | datamash transpose
1  4  7  10  13  16
2  5  8  11  14  17
3  6  9  12  15  18

Reverse the matrix

seq 18 | paste - - - | datamash reverse
3   2   1
6   5   4
9   8   7
12  11  10
15  14  13
18  17  16

Use case: A mobile ad campaign

Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:

  • id
  • device_type, either iPhone or Android
  • clicks number of clicks for given device type and on given date
  • created_at a date on which the event was registered
head mobile-ad-campaign.csv
id,device_type,clicks,created_at
1,Android,620,2014-08-12
2,Android,886,2014-08-11
3,iPhone,5,2014-08-13
4,Android,644,2014-08-11

There should be 1000 records in that file. We can quickly count lines using wc command.

wc -l mobile-ad-campaign.csv
    1000 mobile-ad-campaign.csv

Also we have been told that the line 17 seems broken. Let’s check that.

sed 1d mobile-ad-campaign.csv | sed -n 17,17p
17,Andr0id,308,2014-08-12

Indeed, there is Andr0id written with 0 instead of o. Once corrected, we are now ready to perform a simple, ad-hoc analysis of the data.

Total number of clicks

datamash -t, -H sum 3 < mobile-ad-campaign.csv
sum(clicks)
500932
  • -t, specifies , as a separator instead of default tab character.
  • -H removes the headers

Number of clicks per device

datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
GroupBy(device_type),sum(clicks)
Android,259455
iPhone,241477
  • -g 2 specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.

Number of clicks per date and the mean per date

datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv
GroupBy(created_at),sum(clicks),mean(clicks)
2014-08-10,119952,493.62962962963
2014-08-11,118397,503.8170212766
2014-08-12,115438,493.32478632479
2014-08-13,147145,512.70034843206

Top 3 days with the most of the clicks

datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv  | sed 1d | sort -n -t, -k2 -r | head -n 3
2014-08-13,147145
2014-08-10,119952
2014-08-11,118397
  • sed 1d removes the header
  • sort with -n compares strings using numerical value -k2 specifies the field, in this case it is the sum of clicks per date; with -r we reverse the elements to enforce descending order
  • finally head -n 3 displays only three first elements.

Summary

datamash is a simple tool that quickly allows to perform a preliminary data analysis. It is easy to get started with and it works well for basic cases of data analysis.

Additionally, parsing CSV files may be tricky and a caution is needed when dealing with them, i.e. if a field delimiter may be also present within that field (e.g. , character), most *nix tools (cut, sort, etc.) won’t be able to handle that situation correctly. TSV format is often better because tab stops are infrequent in texts (datamash supports TSV by default).