Chris L Keller ...

Text

Link: John Keefe on Designing a Course in Data Journalism

When asked what such a course might look like, I point folks to Brian Boyer’s theoretical and amazing Hacker Journalism 101. Journalism needs more journalists who can code, and this is a great way to get there. …

In addition to programming, here are some of my favorite topics for classes, readings or workshops:

  • Finding data for your stories
  • Finding stories in your data
  • How to tell one story well
  • All data is dirty … and what to do about that
  • Basic stats
  • Percentage points for journalists
  • Mapmaking made easy
  • Lying and truthing with easily-made maps
  • When maps shouldn’t be maps
  • Basic chartbuilding
  • Lying and truthing with charts and graphs
  • Did I mention programming?

It’s up to the individual to find what works for them, but that said, what worked for me was to first learn code to visualize data followed by learn code to gather data.

Now I need to learn code to analyze data while refining and bettering my skills in the first two areas, especially developing a better eye for visualization and when not to use the hammers I’ve acquired…

Text

Using csvkit and PostgreSQL to mimic a bit of Microsoft Access

Over the last couple weeks I had the first real chance to take a really large dataset and explore it using tools like csvkit and PostgreSQL.

Needless to say I learned a lot, and with that learning came a bit of backtracking. But something I have learned to do as a matter of course now is to document every step of a project — whether coding or writing. The extra effort not only comes in handy when creating a workflow, but offers plenty of advice and ideas for others who are also learning. What follows are an edited/revised version of the notes from my scratchpad file.

This dataset came in 15 text files — five main files and two sets of five text files related to the main set through a common ID. In total there were some 125,000 rows. I also received a “column key” describing the name of each field, the data types for each filed and what designations stood for.

Without access to Access, and after several attempts to use LibreOffice Base to create three tables in a database and JOIN them based on common ID — crash, crash, crash — I sucked it up and learned how to use csvkit and PostgreSQL. Boy am I glad I did. And boy am I thankful for Christopher Groskopf’s csvkit and his imagination about what is possible, and the ability to progam it.

Thinking about it now reminds me of something Mark Ng said duing #NICAR12 regarding Chris: “He’s from the future”

As I mentioned, I attempted to document each step, query, failure, success and learning moment, so while this is very much a how to, it is also very long, winding and potentially confusing to some. That said, if you like journo-tech-nerdery — or what to offer some tips — please enjoy.


Getting started with some tools

csvkit is a command-line toolset to explore CSV files. It can be installed into a virtual environment which isolates it from key system packages:

sudo pip install virtualenv virtualenv venv source venv/bin/activate pip install csvkit 

It can also be installed globally:

sudo pip install csvkit 

If you’re beginning on the command line, on the suggestion of Joe Germuska, this is a great resource to learn from.


Preparing the data for exploration

  1. After importing into Google Docs failed — the dataset was too larger — I opened one of the main text files in LibreOffice to see what I was working with.
  2. The file came without headers so I pulled the first few rows into a new CSV file, imported that into a Google spreadsheet — easier to type in — and used the accompanying key to enter in column information.
  3. I then exported the header row as new file, opened each of the data files in a text editor, pasted the header to the top of the file and saved as CSV. I repeated this step for all 15 files. (There’s a time savings to be found here!)
  4. I fired up csvkit and ran csvclean on each of the 15 files to look for syntax errors or bad rows.
  5. I tried to use csvstack to combine the files, but I ran into errors. That I couldn’t decipher. So instead I manually combined each of the CSV files into one large CSV file with one row of headers. (More time savings to be found!)
  6. I dropped the three combined CSV files into my “data_inbox” and fired up a script based on one authored by Jeff Severns Guntzelthat:
    • Makes a backup of the data files.

    • Creates a new project directory with a specified folder structure.

    • Audits the CSV using csvkit’s csvstat function and creates a data-abstract file.

    • Moves all of the files into the new project directory.

  7. Then I opened up my data-abstract-file to see what I had. Here’s an example of what I was able to see at a glance:

     47. type_of_collision <type 'unicode'> Nulls: False Unique values: 10 5 most frequent values: B: 50090 C: 33952 E: 16700 D: 13184 F: 3913 Max length: 1 

In this case, B, C, E, D and F correspond to a specific value in the column key. Thanks to some magic from Chris and Jeff, at a quick glance I was able to decipher values in every column of the data file and pass information on to the editor for this story.

You can use a csvkit function to do this on an individual column as well, substituting the column_number and file_name in the example command below.

csvstat -c **column_number** --freq **file_name** 

Making the deep dive

Once I learned we wanted to focus on a specific type of record I used csvkit’s csvgrep (search) function to search the rows for that value and output the results to a new file which was only about 4,000 odd rows or so.

csvgrep -c 48 -m "A" **file_name** > **output_file_name** 

I then double-checked my work to make sure I had the expected number of rows.

csvstat **output_file_name** 

The number of rows matched up, but something had changed that would impact my later efforts which I’ll come back to.

Primary in my mind was figuring out a way to JOIN the three datasets together based on their common ID. This is the part where I googled for open source access and LibreOffice Base How-To and struggled to make it work as the application continued to crash when I tried to import the larger secondary file.

Creating django models and relating the data that way came to mind, and then it dawned me that I have mySQL and postgreSql installed in my local development environment and I might as well use them for what they’re supposed to be used for.


Tools revisited

Some kind folks at Heroku have created a one-click installer for PostgreSQL on a Mac. I had gone through the Homebrew route, and while I eventually made it work, the one-click installer worked really well. Just had to open my .bashprofile and set the correct $PATH.

There is great PostgreSQL documentation that walks through the basic use, and I also installed pgAdmin III which is a GUI to explore the databases, which is good for a noob like me.

To get started creating a new database in my local PostgreSQL install I:

  1. Created a new user.
  2. Created a new database for the user.
  3. And then logged into the database I made using the PostgreSQL command line application.
  4. Along the way I learned some additional commands in the PostgreSQL command line application.

Activate PostgreSQL

 psql 

List databases

 \list 

See if table has been created

 \dt 

Drop table

 DROP TABLE **name of table** 

Quit psql

 \q 

I then connected pgAdmin III to my local PostgreSQL instance by specifying the user and leaving the server address blank.

And then the fun began…


Using PostgreSQL like Access

So the goal of the project from the beginning was to take the three data sets and JOIN them together based on the common ID. Forgoing Access and using PostgreSQL meant I had to craft CREATE TABLE statements. Little did I know at the time, someone had already thought of this and developed a crafty solution.

But I’ll come back to that as well. To create the statements I grabbed the header rows from the three files, added data types and crafted a CREATE TABLE statement. So still in the the PostgreSQL command line application I added something like this following:

CREATE TABLE **table_name** (case_id integer, special_information_1 text, special_information_2 text, special_information_3 text); 

pgAdmin III told me it wanted each table to have a primary key before I could edit them, so I spent some time figuring out how to do that on a table using something called a sequence.

ALTER TABLE **table_name** ADD COLUMN record_id serial PRIMARY KEY; 

Then I wanted to upload my CSV files into the PostgreSQL Database table. I found a couple methods — I liked this one.

COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV; 

Then I learned an important fact:

"If you don’t have access to the server, you can use psql’s \COPY command which is very similar to COPY FROM but works with local files. See the manual for details."

\COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV; 

And then that didn’t work either. I later learned another important fact: that if your csv file has a header row, well you need to tell PostgreSQL that, hence this command:

\COPY **table_name** '**path_to_file/**file_name**' DELIMITERS ',' CSV HEADER; 

That worked… because the CSV HEADER designation told PostgreSQL that the CSV file had a header row.

And then I learned that csvkit — wonder of wonders — can not only generate CREATE TABLE statements from a CSV, but it can freaking create a database and upload a CSV to a new table in that database with a simple command. Heck, it can take a whole folder of CSV files and create tables from them. Back to Mark Ng: “From the Future.”


Using csvkit with PostgreSQL (1)

So in learning that I can use several csvkit to do what I had done above, I set out to create the project again using only the utility. My notes are also more streamlined as I have a better idea of how to achieve my end result.

With PostgreSQL up and running, you can use csvkit to create a database.

createdb **name_of_database** 

From there, you can use csvkit’s csvsql command to generate a CREATE TABLE statement based on your CSV file.

csvsql -i postgresql **file_name** 

Running the above command in the terminal left me with something like this that I can use in the PostgreSQL application:

CREATE TABLE "**name_of_table**" ( case_id INTEGER NOT NULL, special_information_1 VARCHAR(1) NOT NULL, special_information_2 VARCHAR(1) NOT NULL, special_information_3 VARCHAR(1) NOT NULL, other_associated_factor_1 VARCHAR(1) NOT NULL, other_associated_factor_2 VARCHAR(1) NOT NULL, ); 

But you can also use csvsql to create the database and import data from a CSV directly into PostgreSQL. Make sure your virtual environment has psycopg2 or MySQL-python installed depending on your choice of database. Use pip install psycopg2 or pip install MySQL-python

createdb *name_of_database* csvsql --db postgresql:///*name_of_database* --insert *name_of_file* csvsql --db postgresql:///*name_of_database* --insert *name_of_file* csvsql --db postgresql:///*name_of_database* --insert *name_of_file* 

And best of all, you can use csvsql to create tables for an entire folder of CSV files and create PostgreSQL tables based on those files.

createdb *name_of_database* csvsql --db postgresql:///*name-of-database* --insert *path-to-file*/*name-of-file*.csv 

Log into PostgreSQL application from the command line and add primary keys to the tables after they are created per pgAdmin III.

psql *name_of_database* ALTER TABLE *name_of_table1* ADD COLUMN record_id serial PRIMARY KEY; ALTER TABLE *name_of_table2* ADD COLUMN record_id serial PRIMARY KEY; ALTER TABLE *name_of_table3* ADD COLUMN record_id serial PRIMARY KEY; 

Learn to create a join on two tables using case_id. Great visual guide is here.

SELECT * FROM *name_of_table1*; (4662 rows) SELECT * FROM *name_of_table2*; (95716 rows) SELECT * FROM *name_of_table3*; (257333 rows) SELECT name_of_table1.*, name_of_table2.* FROM name_of_table1, name_of_table2 WHERE name_of_table1.case_id = name_of_table2.case_id; (5906 rows) 

Export to a new file, I’ll call it joined_new_file.csv.

Load file as new table

csvsql --db postgresql:///*name-of-database* --insert *joined_new_file.csv* 

Test various JOIN methods to get proper matching of records. This case records in Table 1 could have several corresponding records in Table 2. So I went with a LEFT OUTER JOIN.

It was about this time that I learned that the data type of a column in my table had changed from to . I was able to trace the changce back to something that happened when I used csvstat.

I filtered the file to give me all the rows with a single value. In this case it was the letter “A”. When I created a new table based on that filter I got the . Some helpful advice from Adam Hooper on the NICAR listserv gave me the how’s and why’s.

In CSV, every value is a string—no exceptions. csvkit “intelligently” guesses at a column type by looking at every string value and seeing if they all follow the same pattern.

For instance, if every value looks like “12:23:31”, it’ll pick a time format.

If you have 200,000 rows, and one row has a value that isn’t a time format, csvkit will (correctly) call it a “string”. But if you remove the row that isn’t formatted like a time, then csvkit will (correctly) call the column “time” in the resulting CSV.

I suppose you’ve read this far and gone, “yes, but why would it think ‘A’ is a datetime?” And the answer is … (wait for it) … “A” is a time value! “A” is short for “AM”, so it gets parsed as 12:00 a.m. on the morning the program is executed. (The same trick doesn’t work for “P”, because of a bug in the “dateutil” library csvkit depends on.)

And since every value in column 48 is now ‘A’, every value is a datetime — meaning the entire column is a datetime column. (This wasn’t the case before, because there were non-“A” values in column 48, so csvkit didn’t interpret it as a datetime.)

And for my purposes I didn’t really need column 48 to be the correct data type. In this case, the data was all of the same record. But because it had changed on me I began to wonder what else might have changed on me.

Regardless, I figured I’d give it another go just to streamline things a bit.


Using csvkit with PostgreSQL (2)

  1. Run csvstat to make sure data types are intact.
  2. Use csvkit to create PostgreSQL database and tables from an entire folder of CSVs.
  3. Query working_accidents.csv for rows in which motor_vehicle_involved_with = non-collision and save as working_non_collisions.csv.
  4. Generate CREATE statement for working_non_collisions.csv to upload into PostgreSQL.
  5. Add data from csv into the working_non_collisions table. The CSV HEADER designation says its a CSV file with a header row.
  6. Query for the related records on working_non_collisions and working_party tables to get expected output.
  7. Run INNER JOIN to see of same result occurs.
  8. Export as accidents_non_collisions_and_parties.csv. Open in LibreOffice and delete the duplicate case_id column.
  9. Generate CREATE statement for accidents_non_collisions_and_parties.csv to upload into PostgreSQL. In this instance there was a duplicate column for case_id that happened when I ran the select statement. Also changed motor_vehicle_involved_with to VARCHAR(1) and oaf_violation_code VARCHAR(4).
  10. Add data from csv into the accidents_non_collisions_and_parties table. The CSV HEADER designation says its a CSV file with a header row.
  11. Query on connections between accident & parties and victims. But then I realized this ignores accidents in which there was no victim.
  12. Test various JOIN methods to see different results. I want all records from accidents_non_collisions_and_parties and any that match in working_victim, which I learned is a LEFT OUTER JOIN.
  13. Export as joined_accidents_non_collisions.csv. Open in LibreOffice and delete the duplicate case_id columns.
  14. Doublecheck to see that motor_vehicle_involved_with column hasn’t changed to date-time type.
  15. Generate CREATE statement for working_non_collisions.csv to upload into PostgreSQL.
  16. Run various queries from within pgAdmin III
Text

Introducing MadCrime, a Django-based scraper of incident reports released by the Madison Police Department

tl:dr - I can finally say I made a Django application, and it’s live on the web and available on GitHub. For now I’ll call it MadCrime, and it scrapes Madison Police Department incident reports and displays the content to the user.

It’s been more than a year in the making, but I can finally say I built and deployed a Django project.

The project itself isn’t overly complicated, but it has a high ceiling and most of all has given me some confidence in my practice habits and ability to retain concepts.

For now I’m calling it MadCrime, and its purpose is to scrape Madison Police Department incident reports, store them in a database and display the content to the user with a map of the incident. It also offers access to the JSON of the incidents using Tastypie.

There are limitations; some from the data, more based on my abilities. For instance, suspect, arrest and victim details are entered into one field, and I would need to either manually separate out, or figure out how to parse out the details into separate fields. More importantly, this table is not the comprehensive list of Madison Police Department incident reports, but rather a snapshot that is released to the public.

But with the possibility of the open data ordinance being passed and implemented that might change, so in some terms the life span of MadCrime as it is right now is likely limited. Which is why it was important for me to “get this out the door” somewhat quickly and call it version 1 as opposed to sitting on it while learning how to implement search, or make the main display sortable.

The code is on GitHub, and I have a nice long list of potential improvements, including adding search functionality and separating out information like suspect information that is contained in one singular field in the incident report.

The idea to scrape police incident reports and map from CityCampMadison last month, but the follow-through and execution has been informed and inspired by so many different projects and thought-provoking discussions, many of which I’ve mentioned in previous posts:

This little program won’t write the story itself, had the data been available I could have used a tool like this to create shells for briefs that could have been edited down so I could move on to focus on larger stories.

And with some simple search capabilities, maybe even find some similarities between crimes and suspects.

Quote
"Think about what this kind of thing means. It means that we cannot assume that the campaign visible to the mass audience is the same campaign that’s being pitched to individuals and groups around the nation, and that winning coalitions can be built not just by harnessing large groups (unions, religious voters, etc.) but also by piecing them together in small units"

What We Don’t Know About Elections :: The Scoop

Text

Fantastic thoughts to live — and work — for

  • Working longer is not working harder. Go home.
  • 40 hours includes the commute.
  • Resist the urge to call a meeting.
  • Consensus is bullshit. Do what’s best for the company.
  • We will not ship shit.
  • If it’s hard to use, it hurts the product, the company, and our reputation.
  • You will do customer service for the products you create.
  • We will work in public. Our code, our designs, and our ideas are not our business. We are our business. Our products are our business. Customers are our business.
  • People are expensive. We will strive to keep individuals working for the customer, and not on the business.
  • Efficiency is everything. We will remove obstructions.
  • We will not celebrate. Every day will be awesome, or we stop what we’re doing and make it awesome. Cake is for gazelles, and we are lions.
  • We will enjoy our work, our products, and our lives. We will make work awesome.
  • We will be lean. We will NOT ignore problems, and we will swarm when issues arise.
  • You are an adult. You will manage your own machine and tools.

Via stefankendall.com: Core Values.