Mashing data at CLI with Datamash


datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.



brew install datamash

On Debian/Ubuntu

sudo dpkg -i datamash_1.0.6-1_amd64.deb

On Fedora/RedHat/CentOS

sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm

Basic examples

Let’s start with simple examples to get some intuition for how the tool works.

Sum of all numbers from 1 to 20

seq 20 | datamash sum 1

Count, sum and get the the mean of even numbers from 1 to 20

seq 20 | paste - - | datamash sum 2 count 2 mean 2
110  10   11

Sum of the numbers divisible by 3 between 1 to 20

seq 18 | paste - - | datamash sum 3

In this example I had to change 20 to 18 so datamash could calculate the sum properly. The reason « why » is left as an exercise for the reader.

Transpose the matrix

seq 18 | paste - - - | datamash transpose
1  4  7  10  13  16
2  5  8  11  14  17
3  6  9  12  15  18

Reverse the matrix

seq 18 | paste - - - | datamash reverse
3   2   1
6   5   4
9   8   7
12  11  10
15  14  13
18  17  16

Use case: A mobile ad campaign

Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:

  • id
  • device_type, either iPhone or Android
  • clicks number of clicks for given device type and on given date
  • created_at a date on which the event was registered
head mobile-ad-campaign.csv

There should be 1000 records in that file. We can quickly count lines using wc command.

wc -l mobile-ad-campaign.csv
    1000 mobile-ad-campaign.csv

Also we have been told that the line 17 seems broken. Let’s check that.

sed 1d mobile-ad-campaign.csv | sed -n 17,17p

Indeed, there is Andr0id written with 0 instead of o. Once corrected, we are now ready to perform a simple, ad-hoc analysis of the data.

Total number of clicks

datamash -t, -H sum 3 < mobile-ad-campaign.csv
  • -t, specifies , as a separator instead of default tab character.
  • -H removes the headers

Number of clicks per device

datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
  • -g 2 specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.

Number of clicks per date and the mean per date

datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv

Top 3 days with the most of the clicks

datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv  | sed 1d | sort -n -t, -k2 -r | head -n 3
  • sed 1d removes the header
  • sort with -n compares strings using numerical value -k2 specifies the field, in this case it is the sum of clicks per date; with -r we reverse the elements to enforce descending order
  • finally head -n 3 displays only three first elements.


datamash is a simple tool that quickly allows to perform a preliminary data analysis. It is easy to get started with and it works well for basic cases of data analysis.

Additionally, parsing CSV files may be tricky and a caution is needed when dealing with them, i.e. if a field delimiter may be also present within that field (e.g. , character), most *nix tools (cut, sort, etc.) won’t be able to handle that situation correctly. TSV format is often better because tab stops are infrequent in texts (datamash supports TSV by default).