We had a lot of CSV data to parse and slice when we built our end of year
report. We couldn’t easily parse CSV using sed
, because a simple “split
on commas” strategy quickly fell down. CSV has a surprisingly complicated
spec that (for example) allows commas inside values. Instead, we
needed something that’s built to parse CSV in all of its intricacy.
Enter CSVkit, “a suite of utilities for converting to and working with CSV”.
It’s cut
, sort
, and grep
for CSV.
Let’s try out CSVkit’s tools on this list of conferences in CSV format:
name,start_date,end_date,location,url
"Rubyfuza","02/06/2014","02/08/2014","Cape Town","http://www.rubyfuza.org/"
"Launch","02/26/2014","02/28/2014","SF","http://events.launch.co/festival"
"EmberConf","02/23/2014","03/24/2014","Portland","http://ember.com"
"NSConf","03/17/2014","03/19/2014","England","http://nsconference.com"
"RailsConf","04/22/2014","04/25/2014","Chicago","http://www.railsconf.com/"
csvcut
Let’s use csvcut
to grab the name and location of each conference:
$ csvcut --columns name,location conferences.csv
name,location
Rubyfuza,Cape Town
Launch,SF
EmberConf,Portland
NSConf,England
RailsConf,Chicago
The --columns
option determines which column the command is operating on. It’s
used in nearly all of the CSVkit tools.
csvsort
csvsort
can sort data by a specific column. Let’s sort by name:
$ csvsort --columns name conferences.csv
name,start_date,end_date,location,url
EmberConf,2014-02-23,2014-03-24,Portland,http://ember.com
Launch,2014-02-26,2014-02-28,SF,http://events.launch.co/festival
NSConf,2014-03-17,2014-03-19,England,http://nsconference.com
RailsConf,2014-04-22,2014-04-25,Chicago,http://www.railsconf.com/
Rubyfuza,2014-02-06,2014-02-08,Cape Town,http://www.rubyfuza.org/
Note that the output of csvsort
, like the output from other CSVkit tools, is
itself valid CSV. That means we can pipe it to other CSVkit tools, or take in
other tools’ input. Here are the names and locations of the conferences, sorted
by name:
$ csvcut --columns name,location conferences.csv | csvsort --columns name
name,location
EmberConf,Portland
Launch,SF
NSConf,England
RailsConf,Chicago
Rubyfuza,Cape Town
csvgrep
csvgrep
filters your data to only rows where a column matches a
specific value.
Let’s search the name
column for values that match the word “Conf”:
$ csvgrep --columns name --match Conf conferences.csv
name,start_date,end_date,location,url
EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com
NSConf,03/17/2014,03/19/2014,England,http://nsconference.com
RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/
We can also use regexes with the --regex
option. Let’s find conferences with
at least two capital letters in their name:
$ csvgrep --columns name --regex "[A-Z].[A-Z]" conferences.csv
name,start_date,end_date,location,url
EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com
NSConf,03/17/2014,03/19/2014,England,http://nsconference.com
RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/
Since CSVgrep is written in Python, it interprets regexes the same way Python
does. If it works in Python, it
should work when passed to --regex
.
Even more
This covers the 80% use case of CSVkit, but if there’s something else you want to do, I strongly recommend reading the CSVkit docs. They’re exceptionally well-written and have helpful real-life examples.