datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.
Installing
On OSX
brew install datamashOn Debian/Ubuntu
wget http://files.housegordon.org/datamash/bin/datamash_1.0.6-1_amd64.debsudo dpkg -i datamash_1.0.6-1_amd64.debOn Fedora/RedHat/CentOS
wget http://files.housegordon.org/datamash/bin/datamash-1.0.6-1.el6.x86_64.rpmsudo rpm -i datamash-1.0.6-1.el6.x86_64.rpmBasic examples
Let’s start with simple examples to get some intuition for how the tool works.
Sum of all numbers from 1 to 20
seq 20 | datamash sum 1
210Count, sum and get the the mean of even numbers from 1 to 20
seq 20 | paste - - | datamash sum 2 count 2 mean 2
110 10 11Sum of the numbers divisible by 3 between 1 to 20
seq 18 | paste - - | datamash sum 3
63In this example I had to change 20 to 18 so datamash could calculate the sum
properly. The reason « why » is left as an exercise for the reader.
Transpose the matrix
seq 18 | paste - - - | datamash transpose
1 4 7 10 13 16
2 5 8 11 14 17
3 6 9 12 15 18Reverse the matrix
seq 18 | paste - - - | datamash reverse
3 2 1
6 5 4
9 8 7
12 11 10
15 14 13
18 17 16Use case: A mobile ad campaign
Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:
iddevice_type, eitheriPhoneorAndroidclicksnumber of clicks for given device type and on given datecreated_ata date on which the event was registered
head mobile-ad-campaign.csv
id,device_type,clicks,created_at
1,Android,620,2014-08-12
2,Android,886,2014-08-11
3,iPhone,5,2014-08-13
4,Android,644,2014-08-11There should be 1000 records in that file. We can quickly count lines using wc
command.
wc -l mobile-ad-campaign.csv
1000 mobile-ad-campaign.csvAlso we have been told that the line 17 seems broken. Let’s check that.
sed 1d mobile-ad-campaign.csv | sed -n 17,17p
17,Andr0id,308,2014-08-12Indeed, there is Andr0id written with 0 instead of o. Once corrected, we
are now ready to perform a simple, ad-hoc analysis of the data.
Total number of clicks
datamash -t, -H sum 3 < mobile-ad-campaign.csv
sum(clicks)
500932-t,specifies,as a separator instead of defaulttabcharacter.-Hremoves the headers
Number of clicks per device
datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv
GroupBy(device_type),sum(clicks)
Android,259455
iPhone,241477-g 2specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.
Number of clicks per date and the mean per date
datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv
GroupBy(created_at),sum(clicks),mean(clicks)
2014-08-10,119952,493.62962962963
2014-08-11,118397,503.8170212766
2014-08-12,115438,493.32478632479
2014-08-13,147145,512.70034843206Top 3 days with the most of the clicks
datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv | sed 1d | sort -n -t, -k2 -r | head -n 3
2014-08-13,147145
2014-08-10,119952
2014-08-11,118397sed 1dremoves the headersortwith-ncompares strings using numerical value-k2specifies the field, in this case it is the sum of clicks per date; with-rwe reverse the elements to enforce descending order- finally
head -n 3displays only three first elements.
Summary
datamash is a simple tool that quickly allows to perform a preliminary data
analysis. It is easy to get started with and it works well for basic cases of
data analysis.
Additionally, parsing CSV files may be tricky and a caution is needed when
dealing with them, i.e. if a field delimiter may be also present within that
field (e.g. , character), most *nix tools (cut, sort, etc.) won’t be able
to handle that situation correctly. TSV format is often better because tab
stops are infrequent in texts (datamash supports TSV by default).