Mashing data at CLI with Datamash
datamash is a tool that allows to group, count elements and perform a simple statistical data analysis right from the command line.
brew install datamash
sudo dpkg -i datamash_1.0.6-1_amd64.deb
sudo rpm -i datamash-1.0.6-1.el6.x86_64.rpm
Let’s start with simple examples to get some intuition for how the tool works.
Sum of all numbers from 1 to 20
seq 20 | datamash sum 1 210
Count, sum and get the the mean of even numbers from 1 to 20
seq 20 | paste - - | datamash sum 2 count 2 mean 2 110 10 11
Sum of the numbers divisible by 3 between 1 to 20
seq 18 | paste - - | datamash sum 3 63
In this example I had to change 20 to 18 so
datamash could calculate the sum properly. The reason « why » is left as an exercise for the reader.
Transpose the matrix
seq 18 | paste - - - | datamash transpose 1 4 7 10 13 16 2 5 8 11 14 17 3 6 9 12 15 18
Reverse the matrix
seq 18 | paste - - - | datamash reverse 3 2 1 6 5 4 9 8 7 12 11 10 15 14 13 18 17 16
Use case: A mobile ad campaign
Let’s image we just got a CSV file that summarises our advertising campaign on mobile platforms in the last few days. The file has four columns:
clicksnumber of clicks for given device type and on given date
created_ata date on which the event was registered
head mobile-ad-campaign.csv id,device_type,clicks,created_at 1,Android,620,2014-08-12 2,Android,886,2014-08-11 3,iPhone,5,2014-08-13 4,Android,644,2014-08-11
There should be 1000 records in that file. We can quickly count lines using
wc -l mobile-ad-campaign.csv 1000 mobile-ad-campaign.csv
Also we have been told that the line 17 seems broken. Let’s check that.
sed 1d mobile-ad-campaign.csv | sed -n 17,17p 17,Andr0id,308,2014-08-12
Indeed, there is
Andr0id written with
0 instead of
o. Once corrected, we are now ready to perform a simple, ad-hoc analysis of the data.
Total number of clicks
datamash -t, -H sum 3 < mobile-ad-campaign.csv sum(clicks) 500932
,as a separator instead of default
-Hremoves the headers
Number of clicks per device
datamash -t, -H -s -g 2 sum 3 < mobile-ad-campaign.csv GroupBy(device_type),sum(clicks) Android,259455 iPhone,241477
-g 2specifies the field on which we will group, in this case it’s the 2nd field i.e. the device type.
Number of clicks per date and the mean per date
datamash -t, -H -s -g 4 sum 3 mean 3 < mobile-ad-campaign.csv GroupBy(created_at),sum(clicks),mean(clicks) 2014-08-10,119952,493.62962962963 2014-08-11,118397,503.8170212766 2014-08-12,115438,493.32478632479 2014-08-13,147145,512.70034843206
Top 3 days with the most of the clicks
datamash -t, -H -s -g 4 sum 3 < mobile-ad-campaign.csv | sed 1d | sort -n -t, -k2 -r | head -n 3 2014-08-13,147145 2014-08-10,119952 2014-08-11,118397
sed 1dremoves the header
-ncompares strings using numerical value
-k2specifies the field, in this case it is the sum of clicks per date; with
-rwe reverse the elements to enforce descending order
head -n 3displays only three first elements.
datamash is a simple tool that quickly allows to perform a preliminary data analysis. It is easy to get started with and it works well for basic cases of data analysis.
Additionally, parsing CSV files may be tricky and a caution is needed when dealing with them, i.e. if a field delimiter may be also present within that field (e.g.
, character), most *nix tools (
sort, etc.) won’t be able to handle that situation correctly. TSV format is often better because tab stops are infrequent in texts (
datamash supports TSV by default).