We had a lot of CSV data to parse and slice when we built our end of year
report. We couldn’t easily parse CSV using
sed, because a simple “split
on commas” strategy quickly fell down. CSV has a surprisingly complicated
spec that (for example) allows commas inside values. Instead, we
needed something that’s built to parse CSV in all of its intricacy.
Enter CSVkit, “a suite of utilities for converting to and working with CSV”.
grep for CSV.
Let’s try out CSVkit’s tools on this list of conferences in CSV format:
name,start_date,end_date,location,url "Rubyfuza","02/06/2014","02/08/2014","Cape Town","http://www.rubyfuza.org/" "Launch","02/26/2014","02/28/2014","SF","http://events.launch.co/festival" "EmberConf","02/23/2014","03/24/2014","Portland","http://ember.com" "NSConf","03/17/2014","03/19/2014","England","http://nsconference.com" "RailsConf","04/22/2014","04/25/2014","Chicago","http://www.railsconf.com/"
csvcut to grab the name and location of each conference:
$ csvcut --columns name,location conferences.csv name,location Rubyfuza,Cape Town Launch,SF EmberConf,Portland NSConf,England RailsConf,Chicago
--columns option determines which column the command is operating on. It’s
used in nearly all of the CSVkit tools.
csvsort can sort data by a specific column. Let’s sort by name:
$ csvsort --columns name conferences.csv name,start_date,end_date,location,url EmberConf,2014-02-23,2014-03-24,Portland,http://ember.com Launch,2014-02-26,2014-02-28,SF,http://events.launch.co/festival NSConf,2014-03-17,2014-03-19,England,http://nsconference.com RailsConf,2014-04-22,2014-04-25,Chicago,http://www.railsconf.com/ Rubyfuza,2014-02-06,2014-02-08,Cape Town,http://www.rubyfuza.org/
Note that the output of
csvsort, like the output from other CSVkit tools, is
itself valid CSV. That means we can pipe it to other CSVkit tools, or take in
other tools’ input. Here are the names and locations of the conferences, sorted
$ csvcut --columns name,location conferences.csv | csvsort --columns name name,location EmberConf,Portland Launch,SF NSConf,England RailsConf,Chicago Rubyfuza,Cape Town
csvgrep filters your data to only rows where a column matches a
Let’s search the
name column for values that match the word “Conf”:
$ csvgrep --columns name --match Conf conferences.csv name,start_date,end_date,location,url EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com NSConf,03/17/2014,03/19/2014,England,http://nsconference.com RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/
We can also use regexes with the
--regex option. Let’s find conferences with
at least two capital letters in their name:
$ csvgrep --columns name --regex "[A-Z].*[A-Z]" conferences.csv name,start_date,end_date,location,url EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com NSConf,03/17/2014,03/19/2014,England,http://nsconference.com RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/
Since CSVgrep is written in Python, it interprets regexes the same way Python
does. If it works in Python, it
should work when passed to
This covers the 80% use case of CSVkit, but if there’s something else you want to do, I strongly recommend reading the CSVkit docs. They’re exceptionally well-written and have helpful real-life examples.