datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.
Installing
On OSX
brew install datamash
On Debian/Ubuntu
wget http://files.housegordon.org/datamash/bin/datamash_1.0.6-1_amd64.deb
sudo dpkg -i datamash_1.0.6-1_amd64.deb
On Fedora/RedHat/CentOS
wget http://files.housegordon.org/datamash/bin/datamash-1.0.6-1.el6.x86_64.rpm
sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm
Basic examples
Let’s start with simple examples to get some intuition for how the tool works.
Sum of all numbers from 1 to 20
seq 20 | datamash sum 1
210
Count, sum and get the the mean of even numbers from 1 to 20
seq 20 | paste - - | datamash sum 2 count 2 mean 2
110 10 11
Sum of the numbers divisible by 3 between 1 to 20
seq 18 | paste - - | datamash sum 3
63
In this example I had to change 20 to 18 so datamash
could calculate the sum
properly. The reason « why » is left as an exercise for the reader.
Transpose the matrix
seq 18 | paste - - - | datamash transpose
1 4 7 10 13 16
2 5 8 11 14 17
3 6 9 12 15 18
Reverse the matrix
seq 18 | paste - - - | datamash reverse
3 2 1
6 5 4
9 8 7
12 11 10
15 14 13
18 17 16
Use case: A mobile ad campaign
Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:
id
device_type
, eitheriPhone
orAndroid
clicks
number of clicks for given device type and on given datecreated_at
a date on which the event was registered
head mobile-ad-campaign.csv
id,device_type,clicks,created_at
1,Android,620,2014-08-12
2,Android,886,2014-08-11
3,iPhone,5,2014-08-13
4,Android,644,2014-08-11
There should be 1000 records in that file. We can quickly count lines using wc
command.
wc -l mobile-ad-campaign.csv
1000 mobile-ad-campaign.csv
Also we have been told that the line 17 seems broken. Let’s check that.
sed 1d mobile-ad-campaign.csv | sed -n 17,17p
17,Andr0id,308,2014-08-12
Indeed, there is Andr0id
written with 0
instead of o
. Once corrected, we
are now ready to perform a simple, ad-hoc analysis of the data.
Total number of clicks
datamash -t, -H sum 3 < mobile-ad-campaign.csv
sum(clicks)
500932
-t,
specifies,
as a separator instead of defaulttab
character.-H
removes the headers
Number of clicks per device
datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
GroupBy(device_type),sum(clicks)
Android,259455
iPhone,241477
-g 2
specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.
Number of clicks per date and the mean per date
datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv
GroupBy(created_at),sum(clicks),mean(clicks)
2014-08-10,119952,493.62962962963
2014-08-11,118397,503.8170212766
2014-08-12,115438,493.32478632479
2014-08-13,147145,512.70034843206
Top 3 days with the most of the clicks
datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv | sed 1d | sort -n -t, -k2 -r | head -n 3
2014-08-13,147145
2014-08-10,119952
2014-08-11,118397
sed 1d
removes the headersort
with-n
compares strings using numerical value-k2
specifies the field, in this case it is the sum of clicks per date; with-r
we reverse the elements to enforce descending order- finally
head -n 3
displays only three first elements.
Summary
datamash
is a simple tool that quickly allows to perform a preliminary data
analysis. It is easy to get started with and it works well for basic cases of
data analysis.
Additionally, parsing CSV files may be tricky and a caution is needed when
dealing with them, i.e. if a field delimiter may be also present within that
field (e.g. ,
character), most *nix tools (cut
, sort
, etc.) won’t be able
to handle that situation correctly. TSV format is often better because tab
stops are infrequent in texts (datamash
supports TSV by default).
My Tech Newsletter
Get emails from me about programming & web development. I usually send it once a month