Over the last couple weeks I had the first real chance to take a really large dataset and explore it using tools like csvkit and PostgreSQL.
Needless to say I learned a lot, and with that learning came a bit of backtracking. But something I have learned to do as a matter of course now is to document every step of a project — whether coding or writing. The extra effort not only comes in handy when creating a workflow, but offers plenty of advice and ideas for others who are also learning. What follows are an edited/revised version of the notes from my scratchpad file.
This dataset came in 15 text files — five main files and two sets of five text files related to the main set through a common ID. In total there were some 125,000 rows. I also received a “column key” describing the name of each field, the data types for each filed and what designations stood for.
Without access to Access, and after several attempts to use LibreOffice Base to create three tables in a database and JOIN them based on common ID — crash, crash, crash — I sucked it up and learned how to use csvkit and PostgreSQL. Boy am I glad I did. And boy am I thankful for Christopher Groskopf’s csvkit and his imagination about what is possible, and the ability to progam it.
Thinking about it now reminds me of something Mark Ng said duing #NICAR12 regarding Chris: “He’s from the future”
As I mentioned, I attempted to document each step, query, failure, success and learning moment, so while this is very much a how to, it is also very long, winding and potentially confusing to some. That said, if you like journo-tech-nerdery — or what to offer some tips — please enjoy.
Getting started with some tools
csvkit is a command-line toolset to explore CSV files. It can be installed into a virtual environment which isolates it from key system packages:
sudo pip install virtualenv virtualenv venv source venv/bin/activate pip install csvkit
It can also be installed globally:
sudo pip install csvkit
If you’re beginning on the command line, on the suggestion of Joe Germuska, this is a great resource to learn from.
Preparing the data for exploration
- After importing into Google Docs failed — the dataset was too larger — I opened one of the main text files in LibreOffice to see what I was working with.
- The file came without headers so I pulled the first few rows into a new CSV file, imported that into a Google spreadsheet — easier to type in — and used the accompanying key to enter in column information.
- I then exported the header row as new file, opened each of the data files in a text editor, pasted the header to the top of the file and saved as CSV. I repeated this step for all 15 files. (There’s a time savings to be found here!)
- I fired up csvkit and ran csvclean on each of the 15 files to look for syntax errors or bad rows.
- I tried to use csvstack to combine the files, but I ran into errors. That I couldn’t decipher. So instead I manually combined each of the CSV files into one large CSV file with one row of headers. (More time savings to be found!)
- I dropped the three combined CSV files into my “data_inbox” and fired up a script based on one authored by Jeff Severns Guntzelthat:
Makes a backup of the data files.
Creates a new project directory with a specified folder structure.
Audits the CSV using csvkit’s csvstat function and creates a data-abstract file.
Moves all of the files into the new project directory.
Then I opened up my data-abstract-file to see what I had. Here’s an example of what I was able to see at a glance:
47. type_of_collision <type 'unicode'> Nulls: False Unique values: 10 5 most frequent values: B: 50090 C: 33952 E: 16700 D: 13184 F: 3913 Max length: 1
In this case, B, C, E, D and F correspond to a specific value in the column key. Thanks to some magic from Chris and Jeff, at a quick glance I was able to decipher values in every column of the data file and pass information on to the editor for this story.
You can use a csvkit function to do this on an individual column as well, substituting the column_number and file_name in the example command below.
csvstat -c **column_number** --freq **file_name**
Making the deep dive
Once I learned we wanted to focus on a specific type of record I used csvkit’s csvgrep (search) function to search the rows for that value and output the results to a new file which was only about 4,000 odd rows or so.
csvgrep -c 48 -m "A" **file_name** > **output_file_name**
I then double-checked my work to make sure I had the expected number of rows.
The number of rows matched up, but something had changed that would impact my later efforts which I’ll come back to.
Primary in my mind was figuring out a way to JOIN the three datasets together based on their common ID. This is the part where I googled for open source access and LibreOffice Base How-To and struggled to make it work as the application continued to crash when I tried to import the larger secondary file.
Creating django models and relating the data that way came to mind, and then it dawned me that I have mySQL and postgreSql installed in my local development environment and I might as well use them for what they’re supposed to be used for.
Some kind folks at Heroku have created a one-click installer for PostgreSQL on a Mac. I had gone through the Homebrew route, and while I eventually made it work, the one-click installer worked really well. Just had to open my .bashprofile and set the correct $PATH.
There is great PostgreSQL documentation that walks through the basic use, and I also installed pgAdmin III which is a GUI to explore the databases, which is good for a noob like me.
To get started creating a new database in my local PostgreSQL install I:
- Created a new user.
- Created a new database for the user.
- And then logged into the database I made using the PostgreSQL command line application.
- Along the way I learned some additional commands in the PostgreSQL command line application.
See if table has been created
DROP TABLE **name of table**
I then connected pgAdmin III to my local PostgreSQL instance by specifying the user and leaving the server address blank.
And then the fun began…
Using PostgreSQL like Access
So the goal of the project from the beginning was to take the three data sets and JOIN them together based on the common ID. Forgoing Access and using PostgreSQL meant I had to craft CREATE TABLE statements. Little did I know at the time, someone had already thought of this and developed a crafty solution.
But I’ll come back to that as well. To create the statements I grabbed the header rows from the three files, added data types and crafted a CREATE TABLE statement. So still in the the PostgreSQL command line application I added something like this following:
CREATE TABLE **table_name** (case_id integer, special_information_1 text, special_information_2 text, special_information_3 text);
pgAdmin III told me it wanted each table to have a primary key before I could edit them, so I spent some time figuring out how to do that on a table using something called a sequence.
ALTER TABLE **table_name** ADD COLUMN record_id serial PRIMARY KEY;
Then I wanted to upload my CSV files into the PostgreSQL Database table. I found a couple methods — I liked this one.
COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV;
Then I learned an important fact:
"If you don’t have access to the server, you can use psql’s \COPY command which is very similar to COPY FROM but works with local files. See the manual for details."
\COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV;
And then that didn’t work either. I later learned another important fact: that if your csv file has a header row, well you need to tell PostgreSQL that, hence this command:
\COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV HEADER;
That worked… because the CSV HEADER designation told PostgreSQL that the CSV file had a header row.
And then I learned that csvkit — wonder of wonders — can not only generate CREATE TABLE statements from a CSV, but it can freaking create a database and upload a CSV to a new table in that database with a simple command. Heck, it can take a whole folder of CSV files and create tables from them. Back to Mark Ng: “From the Future.”
Using csvkit with PostgreSQL (1)
So in learning that I can use several csvkit to do what I had done above, I set out to create the project again using only the utility. My notes are also more streamlined as I have a better idea of how to achieve my end result.
With PostgreSQL up and running, you can use csvkit to create a database.
From there, you can use csvkit’s csvsql command to generate a CREATE TABLE statement based on your CSV file.
csvsql -i postgresql **file_name**
Running the above command in the terminal left me with something like this that I can use in the PostgreSQL application:
CREATE TABLE "**name_of_table**" ( case_id INTEGER NOT NULL, special_information_1 VARCHAR(1) NOT NULL, special_information_2 VARCHAR(1) NOT NULL, special_information_3 VARCHAR(1) NOT NULL, other_associated_factor_1 VARCHAR(1) NOT NULL, other_associated_factor_2 VARCHAR(1) NOT NULL, );
But you can also use csvsql to create the database and import data from a CSV directly into PostgreSQL. Make sure your virtual environment has psycopg2 or MySQL-python installed depending on your choice of database. Use pip install psycopg2 or pip install MySQL-python
createdb *name_of_database* csvsql --db postgresql:///*name_of_database* --insert *name_of_file* csvsql --db postgresql:///*name_of_database* --insert *name_of_file* csvsql --db postgresql:///*name_of_database* --insert *name_of_file*
And best of all, you can use csvsql to create tables for an entire folder of CSV files and create PostgreSQL tables based on those files.
createdb *name_of_database* csvsql --db postgresql:///*name-of-database* --insert *path-to-file*/*name-of-file*.csv
Log into PostgreSQL application from the command line and add primary keys to the tables after they are created per pgAdmin III.
psql *name_of_database* ALTER TABLE *name_of_table1* ADD COLUMN record_id serial PRIMARY KEY; ALTER TABLE *name_of_table2* ADD COLUMN record_id serial PRIMARY KEY; ALTER TABLE *name_of_table3* ADD COLUMN record_id serial PRIMARY KEY;
Learn to create a join on two tables using case_id. Great visual guide is here.
SELECT * FROM *name_of_table1*; (4662 rows) SELECT * FROM *name_of_table2*; (95716 rows) SELECT * FROM *name_of_table3*; (257333 rows) SELECT name_of_table1.*, name_of_table2.* FROM name_of_table1, name_of_table2 WHERE name_of_table1.case_id = name_of_table2.case_id; (5906 rows)
Export to a new file, I’ll call it joined_new_file.csv.
Load file as new table
csvsql --db postgresql:///*name-of-database* --insert *joined_new_file.csv*
Test various JOIN methods to get proper matching of records. This case records in Table 1 could have several corresponding records in Table 2. So I went with a LEFT OUTER JOIN.
It was about this time that I learned that the data type of a column in my table had changed from to . I was able to trace the changce back to something that happened when I used csvstat.
I filtered the file to give me all the rows with a single value. In this case it was the letter “A”. When I created a new table based on that filter I got the . Some helpful advice from Adam Hooper on the NICAR listserv gave me the how’s and why’s.
In CSV, every value is a string—no exceptions. csvkit “intelligently” guesses at a column type by looking at every string value and seeing if they all follow the same pattern.
For instance, if every value looks like “12:23:31”, it’ll pick a time format.
If you have 200,000 rows, and one row has a value that isn’t a time format, csvkit will (correctly) call it a “string”. But if you remove the row that isn’t formatted like a time, then csvkit will (correctly) call the column “time” in the resulting CSV.
I suppose you’ve read this far and gone, “yes, but why would it think ‘A’ is a datetime?” And the answer is … (wait for it) … “A” is a time value! “A” is short for “AM”, so it gets parsed as 12:00 a.m. on the morning the program is executed. (The same trick doesn’t work for “P”, because of a bug in the “dateutil” library csvkit depends on.)
And since every value in column 48 is now ‘A’, every value is a datetime — meaning the entire column is a datetime column. (This wasn’t the case before, because there were non-“A” values in column 48, so csvkit didn’t interpret it as a datetime.)
And for my purposes I didn’t really need column 48 to be the correct data type. In this case, the data was all of the same record. But because it had changed on me I began to wonder what else might have changed on me.
Regardless, I figured I’d give it another go just to streamline things a bit.
Using csvkit with PostgreSQL (2)
- Run csvstat to make sure data types are intact.
- Use csvkit to create PostgreSQL database and tables from an entire folder of CSVs.
- Query working_accidents.csv for rows in which motor_vehicle_involved_with = non-collision and save as working_non_collisions.csv.
- Generate CREATE statement for working_non_collisions.csv to upload into PostgreSQL.
- Add data from csv into the working_non_collisions table. The CSV HEADER designation says its a CSV file with a header row.
- Query for the related records on working_non_collisions and working_party tables to get expected output.
- Run INNER JOIN to see of same result occurs.
- Export as accidents_non_collisions_and_parties.csv. Open in LibreOffice and delete the duplicate case_id column.
- Generate CREATE statement for accidents_non_collisions_and_parties.csv to upload into PostgreSQL. In this instance there was a duplicate column for case_id that happened when I ran the select statement. Also changed motor_vehicle_involved_with to VARCHAR(1) and oaf_violation_code VARCHAR(4).
- Add data from csv into the accidents_non_collisions_and_parties table. The CSV HEADER designation says its a CSV file with a header row.
- Query on connections between accident & parties and victims. But then I realized this ignores accidents in which there was no victim.
- Test various JOIN methods to see different results. I want all records from accidents_non_collisions_and_parties and any that match in working_victim, which I learned is a LEFT OUTER JOIN.
- Export as joined_accidents_non_collisions.csv. Open in LibreOffice and delete the duplicate case_id columns.
- Doublecheck to see that motor_vehicle_involved_with column hasn’t changed to date-time type.
- Generate CREATE statement for working_non_collisions.csv to upload into PostgreSQL.
- Run various queries from within pgAdmin III