CSVKit Brings the Unix Philosophy to CSV

Gabe Berke-Williams

We had a lot of CSV data to parse and slice when we built our end of year report. We couldn’t easily parse CSV using sed, because a simple “split on commas” strategy quickly fell down. CSV has a surprisingly complicated spec that (for example) allows commas inside values. Instead, we needed something that’s built to parse CSV in all of its intricacy.

Enter CSVkit, “a suite of utilities for converting to and working with CSV”. It’s cut, sort, and grep for CSV.

Let’s try out CSVkit’s tools on this list of conferences in CSV format:

name,start_date,end_date,location,url
"Rubyfuza","02/06/2014","02/08/2014","Cape Town","http://www.rubyfuza.org/"
"Launch","02/26/2014","02/28/2014","SF","http://events.launch.co/festival"
"EmberConf","02/23/2014","03/24/2014","Portland","http://ember.com"
"NSConf","03/17/2014","03/19/2014","England","http://nsconference.com"
"RailsConf","04/22/2014","04/25/2014","Chicago","http://www.railsconf.com/"

csvcut

Let’s use csvcut to grab the name and location of each conference:

$ csvcut --columns name,location conferences.csv

name,location
Rubyfuza,Cape Town
Launch,SF
EmberConf,Portland
NSConf,England
RailsConf,Chicago

The --columns option determines which column the command is operating on. It’s used in nearly all of the CSVkit tools.

csvsort

csvsort can sort data by a specific column. Let’s sort by name:

$ csvsort --columns name conferences.csv

name,start_date,end_date,location,url
EmberConf,2014-02-23,2014-03-24,Portland,http://ember.com
Launch,2014-02-26,2014-02-28,SF,http://events.launch.co/festival
NSConf,2014-03-17,2014-03-19,England,http://nsconference.com
RailsConf,2014-04-22,2014-04-25,Chicago,http://www.railsconf.com/
Rubyfuza,2014-02-06,2014-02-08,Cape Town,http://www.rubyfuza.org/

Note that the output of csvsort, like the output from other CSVkit tools, is itself valid CSV. That means we can pipe it to other CSVkit tools, or take in other tools’ input. Here are the names and locations of the conferences, sorted by name:

$ csvcut --columns name,location conferences.csv | csvsort --columns name

name,location
EmberConf,Portland
Launch,SF
NSConf,England
RailsConf,Chicago
Rubyfuza,Cape Town

csvgrep

csvgrep filters your data to only rows where a column matches a specific value.

Let’s search the name column for values that match the word “Conf”:

$ csvgrep --columns name --match Conf conferences.csv

name,start_date,end_date,location,url
EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com
NSConf,03/17/2014,03/19/2014,England,http://nsconference.com
RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/

We can also use regexes with the --regex option. Let’s find conferences with at least two capital letters in their name:

$ csvgrep --columns name --regex "[A-Z].[A-Z]" conferences.csv

name,start_date,end_date,location,url
EmberConf,02/23/2014,03/24/2014,Portland,http://ember.com
NSConf,03/17/2014,03/19/2014,England,http://nsconference.com
RailsConf,04/22/2014,04/25/2014,Chicago,http://www.railsconf.com/

Since CSVgrep is written in Python, it interprets regexes the same way Python does. If it works in Python, it should work when passed to --regex.

Even more

This covers the 80% use case of CSVkit, but if there’s something else you want to do, I strongly recommend reading the CSVkit docs. They’re exceptionally well-written and have helpful real-life examples.