datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.
Installing
On OSX
brew install datamash
On Debian/Ubuntu
wget http://files.housegordon.org/datamash/bin/datamash_1.0.6-1_amd64.deb
sudo dpkg -i datamash_1.0.6-1_amd64.deb
On Fedora/RedHat/CentOS
wget http://files.housegordon.org/datamash/bin/datamash-1.0.6-1.el6.x86_64.rpm
sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm
Basic examples
Let’s start with simple examples to get some intuition for how the tool works.
Sum of all numbers from 1 to 20
seq 20 | datamash sum 1
210
Count, sum and get the the mean of even numbers from 1 to 20
seq 20 | paste - - | datamash sum 2 count 2 mean 2
110 10 11
Sum of the numbers divisible by 3 between 1 to 20
seq 18 | paste - - | datamash sum 3
63
In this example I had to change 20 to 18 so datamash
could calculate the sum
properly. The reason « why » is left as an exercise for the reader.
Transpose the matrix
seq 18 | paste - - - | datamash transpose
1 4 7 10 13 16
2 5 8 11 14 17
3 6 9 12 15 18
Reverse the matrix
seq 18 | paste - - - | datamash reverse
3 2 1
6 5 4
9 8 7
12 11 10
15 14 13
18 17 16
Use case: A mobile ad campaign
Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:
id
device_type
, eitheriPhone
orAndroid
clicks
number of clicks for given device type and on given datecreated_at
a date on which the event was registered
head mobile-ad-campaign.csv
id,device_type,clicks,created_at
1,Android,620,2014-08-12
2,Android,886,2014-08-11
3,iPhone,5,2014-08-13
4,Android,644,2014-08-11
There should be 1000 records in that file. We can quickly count lines using wc
command.
wc -l mobile-ad-campaign.csv
1000 mobile-ad-campaign.csv
Also we have been told that the line 17 seems broken. Let’s check that.
sed 1d mobile-ad-campaign.csv | sed -n 17,17p
17,Andr0id,308,2014-08-12
Indeed, there is Andr0id
written with 0
instead of o
. Once corrected, we
are now ready to perform a simple, ad-hoc analysis of the data.
Total number of clicks
datamash -t, -H sum 3 < mobile-ad-campaign.csv
sum(clicks)
500932
-t,
specifies,
as a separator instead of defaulttab
character.-H
removes the headers
Number of clicks per device
datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
GroupBy(device_type),sum(clicks)
Android,259455
iPhone,241477
-g 2
specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.
Number of clicks per date and the mean per date
datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv
GroupBy(created_at),sum(clicks),mean(clicks)
2014-08-10,119952,493.62962962963
2014-08-11,118397,503.8170212766
2014-08-12,115438,493.32478632479
2014-08-13,147145,512.70034843206
Top 3 days with the most of the clicks
datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv | sed 1d | sort -n -t, -k2 -r | head -n 3
2014-08-13,147145
2014-08-10,119952
2014-08-11,118397
sed 1d
removes the headersort
with-n
compares strings using numerical value-k2
specifies the field, in this case it is the sum of clicks per date; with-r
we reverse the elements to enforce descending order- finally
head -n 3
displays only three first elements.
Summary
datamash
is a simple tool that quickly allows to perform a preliminary data
analysis. It is easy to get started with and it works well for basic cases of
data analysis.
Additionally, parsing CSV files may be tricky and a caution is needed when
dealing with them, i.e. if a field delimiter may be also present within that
field (e.g. ,
character), most *nix tools (cut
, sort
, etc.) won’t be able
to handle that situation correctly. TSV format is often better because tab
stops are infrequent in texts (datamash
supports TSV by default).