Mashing data at CLI with Datamash

2014-08-08 · 3 min read

datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.



brew install datamash

On Debian/Ubuntu

sudo dpkg -i datamash_1.0.6-1_amd64.deb

On Fedora/RedHat/CentOS

sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm

Basic examples

Let’s start with simple examples to get some intuition for how the tool works.

Sum of all numbers from 1 to 20

seq 20 | datamash sum 1

Count, sum and get the the mean of even numbers from 1 to 20

seq 20 | paste - - | datamash sum 2 count 2 mean 2
110  10   11

Sum of the numbers divisible by 3 between 1 to 20

seq 18 | paste - - | datamash sum 3

In this example I had to change 20 to 18 so datamash could calculate the sum properly. The reason « why » is left as an exercise for the reader.

Transpose the matrix

seq 18 | paste - - - | datamash transpose
1  4  7  10  13  16
2  5  8  11  14  17
3  6  9  12  15  18

Reverse the matrix

seq 18 | paste - - - | datamash reverse
3   2   1
6   5   4
9   8   7
12  11  10
15  14  13
18  17  16

Use case: A mobile ad campaign

Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:

  • id
  • device_type, either iPhone or Android
  • clicks number of clicks for given device type and on given date
  • created_at a date on which the event was registered
head mobile-ad-campaign.csv

There should be 1000 records in that file. We can quickly count lines using wc command.

wc -l mobile-ad-campaign.csv
    1000 mobile-ad-campaign.csv

Also we have been told that the line 17 seems broken. Let’s check that.

sed 1d mobile-ad-campaign.csv | sed -n 17,17p

Indeed, there is Andr0id written with 0 instead of o. Once corrected, we are now ready to perform a simple, ad-hoc analysis of the data.

Total number of clicks

datamash -t, -H sum 3 < mobile-ad-campaign.csv
  • -t, specifies , as a separator instead of default tab character.
  • -H removes the headers

Number of clicks per device

datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
  • -g 2 specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.

Number of clicks per date and the mean per date

datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv

Top 3 days with the most of the clicks

datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv  | sed 1d | sort -n -t, -k2 -r | head -n 3
  • sed 1d removes the header
  • sort with -n compares strings using numerical value -k2 specifies the field, in this case it is the sum of clicks per date; with -r we reverse the elements to enforce descending order
  • finally head -n 3 displays only three first elements.


datamash is a simple tool that quickly allows to perform a preliminary data analysis. It is easy to get started with and it works well for basic cases of data analysis.

Additionally, parsing CSV files may be tricky and a caution is needed when dealing with them, i.e. if a field delimiter may be also present within that field (e.g. , character), most *nix tools (cut, sort, etc.) won’t be able to handle that situation correctly. TSV format is often better because tab stops are infrequent in texts (datamash supports TSV by default).

My Tech Newsletter

Get emails from me about programming & web development. I usually send it once a month