Chris L Keller ...

Month

March 2012

14 posts

If working toward a Demo, & don't want to write a Memo, could really simple slides work?

Demos not Memos was used by Matt Waite in a 2009 blog post outlining how he built PolitiFact, which won a Pulitzer Prize. And it was something drilled — lovingly — into my head during NICAR 12 last week in St. Louis. 

I’ve always been a memo guy and recently graduated to bullet points … largely because my skills hadn’t caught up to my ambition. The last year has gotten better, as I’ve been able to churn out some data maps, I’m hoping to have turned the corner in regards to programming as well, though there is a lot of practice yet. But for the group I’m trying to convince, these slides might do the trick…?

Feb 29, 2012
#web development

February 2012

8 posts

Fresh from #nicar12, here are curated notes to set up a Windows 7 Python environment so I can practice at work

The following are bullet points gathered from walkthroughs created by Anthony DeBarros and the Kenneth Reitz’s Python Guide to get a Python development environment up and running on my work machine.

I needed admin rights for only two steps:

  • Python 2.7.2 installation (Don’t install 3.1)
  • Setting the Python path

Once those steps are completed, these steps will get easy_install pip, virtualenv, django and many other packages up and running.

Download Distribute for Windows and save the script to your Python27 directory.

Open your command prompt by clicking Start Menu and search for cmd to find cmd.exe. You will change into the Python27 directory and run the distribute setup.py script. In turn, you will be able to install pip which is really the gateway drug to adding packages and learning your way.

cd C:\
cd C:\Python27
python distribute_setup.py
easy_install pip
pip install virtualenv

My virtual environments exist inside of C:\Python27\Scripts, so to get there, change to that directory and create a new virtual environment.

cd Scripts
virtualenv --distribute <environment name>

To activate virtual environment:

cd <environment name>
cd Scripts
activate.bat

From here, you can use pip to install python packages to the virtual environment

pip install django
pip install csvkit
pip install BeautifulSoup
pip install mechanize

To deactivate a virtual environment

deactivate

My next goal is to get a version of virtualenvwrapper running, which makes managing virtual environments a piece of cake.

Feb 29, 2012
#nicar #python #web development #Learning Library
Slides from #nicar12 Web Scraping without Programming panel

I had the pleasure of co-presenting at NICAR 2012 in St. Louis with Michelle Minkoff from the Associated Press.

Our panel focused on easy ways for a journalist to start down the path of scraping data from webpages, using common spreadsheet formulas, browser extensions and plugins and no-cost tools.

Below, find our slides for the panel. The tipsheet is here. A companion blog post detailing common HTML tags that a beginner will encounter can be found here. Michelle’s walkthrough of scraping information using the Scraper Chrome extension and the OutwitHub Firefox plugin can be found here.

Feb 26, 20122 notes
#nicar #web scraping #Learning Library
A #nicar12 walkthrough -- HTML tags from 10,000 feet

Many things go into making a webpage pop up in a browser when you type in an URL or click a bookmark. And many things comprise that page — metadata, styles, scripts and markup — to make it do all kinds of neat and interesting things.

But when it comes time to scrape content from a webpage, it’s the markup we’re interested in. In the case of this blog post, we’re looking at HTML and a couple basic tags — or elements — that give structure to a webpage.

For the purposes of this walkthrough, an element is a section of the webpage set off by HTML tag.

HTML tags always exist in pairs. One tag — the opening tag — tells your browser that a new section of the webpage is beginning, and it will have to decide how to display it. The second tag — the closing tag — tells the browser it’s closing that element.

There are many different types of tags, and each is charged with displaying content if different ways.

We’ll start with <div>’s and <table>’s, the latter of which dominated the early days of the World Wide Web, and now are largely used only when the need arises to hold fielded data.

And when we learn to identify these basic containers and how they fit together, we begin learning how we can access the content they hold and pull off a basic scrape of content from a webpage.

Learning to identify how basic containers fit together

In many respects, us journalist-types are tailor-made to learn how a webpage. We’ve all read through a city budget line-by-line, or asked questions to learn definitions to jargon. Learning HTML tags is no different.

Remember, when you started out as a reporter? You likely didn’t know how to call about a particular issue, and the sources on the beat. You accumulated experience and knowledge, and along the way found tools.

When it comes to what we’re about to learn, there is a tool that is indispensible… Web Inspector

I’ll give you some basics to web inspector, or you can Google “web inspector”, or find Dan Nguyen’s tip sheet from NICAR12.

The easiest method is to — hopefully using the latest version of Firefox, Chrome or Safari — highlight something on a webpage, right-click on it and select “Inspect Element.”

So we’ll go to a page that lists Abbotsford child day care facilities and right-click on any element on the page. Doing so will bring up a console with all kinds of code and options and tabs that can look all kinds of scary at first.

To begin, we’re concerned with the main window, which contains the code that creates the webpage you are looking at.

Let’s all get on the same page by selection the first link on the page — A-Zee’s Childcare — selecting it and right-clicking.

You will see that A-Zee’s Childcare is contained within an anchor tag, the kind of HTML that allows the World Wide Web to function by linking pages together.

Often when trying to scrape webpages, the thing we highlight and inspect is the content we want, but our efforts can best be served by working our way out to larger content containers. So as we move out farther, this is what we see:

<tr>
    <td valign="top">
        <img src="/Clients/FHA/FHA_Website.nsf/linksquare.gif" alt="">
        A-Zee's Childcare
    </td>
    <td valign="top" nowrap="">&nbsp;(604) 850-6465</td>
    <td valign="top">&nbsp;&nbsp;Lori Brown</td>
</tr>

And as we’re about to learn, when we see <tr> and <td> on a webpage we can be sure that a <table> is close by.

<table>’s

If you are familiar with a spreadsheet — a series of columns and rows and cells — that’s basically all a HTML table is. And using tags it delivers a series of rows and cells to a webpage making it really easy to spot on a page. When looking for content that I want to scrape or pull into a spreadsheet, if I see a table, I begin to salivate.

First of all, and HTML table looks like a spreadsheet. It has rows that are created from top to bottom, and cells that are created from left to right.

It begins with the <table> tag, which really is nothing more than a container to house the rows and cells.

Within the <table> tag, rows begin with a <tr> tag and end with </tr>.

Within the row — <tr></tr> —you’ll find the cells that hold content. These begin with a tag and end with .

Here’s a sample table:

<table>
<tr>
    <td>First Row, First Cell</td>
    <td>First Row, Second Cell</td>
</tr>
<tr>
    <td>Second Row, First Cell</td>
    <td>Second Row, Second Cell</td>
</tr>
</table>

Each of these tags could have their own attributes, or instructions for the web browser that tell it a certain way to act when it encounters one.

For instance you might see:

<table >

which indicates — because IDs are unique to a tag — that it’s a unique instance of <table>. When it comes to scraping, I like IDs because it allows me to target it exactly.

You might also see a class. A class in HTML is a more general attribute, one that can be used over and over, and on different kinds of tags.

Whereas an ID should only appear once, a class may appear over and over again. So on a web page, you might see:

<table class="purple">

<table class="description">

<table class="author_bio">

IDs and classes aren’t unique to <table>’s. In fact that can be applied to any HTML tag.

<div class="description">

<ul >

<p class="author_bio" href="http://mysite.com">

To repeat, any webpage worth its design will have one instance of a particular ID, but it can be several instances of a class.

<ul> and <p> tags are also HTML elements, and can be targeted for a scrape, but generally are some of the smaller elements you will target.

But the <div>. That’s a good one.

Wonderfully general piece of advice: When learning to scrape, it’s nice to start big and whittle down.

<div>’s

When I started to learn HTML <div>’s were hard for me to understand. I can’t remember exactly what made the concept click.

My gut tells me it took practice, practice and more practice to finally understand how these tags fit together and how they can be manipulated.

Think about it in terms of learning to play the guitar. You learn two or three chords, strum a bit, make little ditties, repeat.

And I like to think of

’s as being like Russian nesting dolls. Within each you will find a smaller and smaller instance that holds content.

<div>’s are the ultimate multi-purpose container, and can be used to hold tables, lists, paragraphs — OK anything — on a webpage. And by targeting their IDs and classes, we can drill down and capture the content.

lists

On a webpage, lists come in a couple varieties: unordered — designated by a <ul> tag — and ordered — designated by a <ol> tag. Both contain a series of items that are contained in <li> tags or “list item” tags.

In my experience the most common variety is the unordered kind.

Like all HTML elements, you can assign IDs and classes to list items — both <ul> tags and <li> tags, which allows us to select them when inspecting a page we want to scrape.

Feb 24, 20124 notes
#html tags #nicar #Learning Library
How a beginner used Python to interact with the Open States API

Updated with some corrections/clarifications from Paul Tagliamonte, who hacked on the newly-released python-sunlight library and blogs here.

Updated with new code snippet to use terminal input to search for campaign contributions to legislators and write to a csv.

A little less than a year ago I tried to play with the Open States legislator API, and with a bunch of help came away with a simple php-based search of our legislators.

If you haven’t played with the Open States API before, you should sign up for an API key and start, but there is a wealth of information — 47 states worth — just waiting to be unlocked by your imagination.

If you at least a passing interest in python, here are some steps that a beginner can use to grab legislator data and write it to a csv. I’ll follow up later with a method of sending that data to Google’s Fusion Tables where you could integrate into a map.

So where to start?

On Monday, Sunlight Labs — part of the Sunlight Foundation with created Open States — announced an updated python library to interact with it’s Congress, Open States and Capitol Words APIs.

You will need an API key to use it, but it’s just a matter of registering and waiting for the email.

Once you have the API key, Sunlight Labs has a couple easy steps to get setup. This is straight from their blog post:

If you’re on a UNIX-type (MacOS, GNU/Linux, *BSD, AIX or Solaris (or any of the other POSIX-ey systems)) machine, you should be able to run a command that looks like the following:

echo "your-api-key-here" > ~/.sunlight.key

It’s worth mentioning that your-api-key-here should actually be your API key that was emailed to you up above.

Via Paul, here’s what is happening at this step:

…this actually creates a “dot” (really: hidden on UNIX systems, so it doesn’t cloud up your folder) file in your home directory

The tilde (~) expands to where your home folder is — so on my system, it’d be something like

/home/tag/

So this actually writes a file to

/home/tag/.sunlight.key

so that each user on the system can have their own key.

The lib (during runtime) will look to see if this one file is present, and use that if it is. You can see if it’s there (as well as tons of other dot-files) by running something like:

ls -la ~

Basically — and correct me if I am wrong here — this you want to make a “dot” file .sunlight.key and place it in your root directory and tell python where to find it.

Then using the python package installer pip, bring the library to your system.

pip install sunlight

Then using the python package installer pip you can install the Sunlight library.

pip install sunlight

However, pip doesn’t come with a default Python installation, so if you receive an error when you type in the command above, you can install setup tools first, which will give you access to the easy_install package manager.

Then it’s a matter of using easy_install to grab pip, and pip to grab sunlight.

easy_install pip
pip install sunlight

Now whenever a walkthrough suggests placing a file that begins with a dot in a directory that has ~ stand in for it, I get a little nervous; I’m not gonna lie. This one was fairly painless.

The documentation and GitHub repo has some sample code — a lot of sample code — and I repurposed a portion for this little snippet. You can save as legi-return.py, and it will return all the legislators for Wisconsin.

#imports the library
import sunlight

#query for wisco legislators
legis =  sunlight.openstates.legislators(
    state='wi'
)

#identify each legislator in the output
for legi in legis:

    #return each legislator
    print "%s %s (%s) District: %s Party: %s ID: %s" % (
        legi['first_name'], legi['last_name'], legi['chamber'], legi['district'], legi['party'], legi['leg_id'])

Just cd into the directory and type python legi-return.py into the terminal and you should see some output.

If you get an error, it might be of the indentation variety, so I’ve also made a gist of the code.

Once I got this up and running, and being a python beginner curious of my own abilities, I started to wonder if I could create a search of legislators from the terminal. So I made this code snippet to search for Wisconsin state legislators by last name. You can call this legi-lookup.py if you want to.

#import library
import sunlight

#ask for input
legi_name = raw_input("Enter the Legislator's Last Name...")

#pull wisco legis with last name entered by user
legis =  sunlight.openstates.legislators(
    state='wi',
    last_name=legi_name
)

#tell the user what you found
print "We found these legislators" 
for legi in legis:
    print "%s %s (%s) District: %s Party: %s ID: %s" % (
        legi['first_name'], legi['last_name'], legi['chamber'], legi['district'], legi['party'], legi['leg_id'])

As you can see, this borrows quite heavily from the first snippet. The difference is I’m asking for input from the terminal — use Fitzgerald — and then narrowing the results based on the input.

Also, this area here…

legis =  sunlight.openstates.legislators(
    state='wi',
    last_name=legi_name
)

…can take all kinds of parameters to narrow your query. There’s a list here you can use.

So I’m feeling all good about thing, and now I want to take my initial query — for all of the state legislators — and I want to write them to a csv. I had this from a prior walkthrough…

import csv

#opens the csv writer, specifies the file name
writer = csv.writer(open('stocks.csv', 'wb', buffering=0))

#data that should be written, separated by comma
writer.writerows([
    ('GOOG', 'Google, Inc.', 505.24, 0.47, 0.09),
    ('YHOO', 'Yahoo! Inc.', 27.38, 0.33, 1.22),
    ('CNET', 'CNET Networks, Inc.', 8.62, -0.13, -1.49)
])

…so I figured the principals were the same.

It actually wasn’t that difficult to figure out since I was already identifying the specific fields that I wanted to be output. You can save this snippet as legi-to-csv.py.

#import libraries
import sunlight
import csv

#pull wisco legis
legis =  sunlight.openstates.legislators(
    state='wi'
)

#open csv writer
writer = csv.writer(open('legi.csv', 'wb', buffering=0), delimiter=';', quoting=csv.QUOTE_ALL)

#open loop
for legi in legis:

    #write csv rows
    writer.writerows([
        (legi['first_name'], legi['last_name'], legi['chamber'], legi['party'] )
    ])

Then assuming you are in the same directory, type python levi-to-csv.py into the terminal and you should see a file called levi.csv added to your directory. Hopefully it has a series of comma delimited fields of legislator goodness.

UPDATE

So here is one more for you…

Firing this snippet in the terminal will ask you for a legislator’s last name, state and campaign cycle, query influence explorer and then write the results to a csv.

#!/usr/bin/env python

#import library
import sunlight
from sunlight import influenceexplorer
import csv

#ask for input
legi_name = raw_input("Input Legislator's last name...")
legi_state = raw_input("Enter state...")
legi_cycle = raw_input("Enter campaign cycle...")

#pull legis with last name entered by user
legis =  sunlight.openstates.legislators(
        state=legi_state,
        last_name=legi_name
)

#queries against influence explorer
contrib = influenceexplorer.contributions(
        contributor_state=legi_state,
        recipient_ft=legi_name,
        cycle=legi_cycle
)

#open csv writer
writer = csv.writer(open('money.csv', 'wb', buffering=0), delimiter=';', quoting=csv.QUOTE_ALL)

print "Writing CSV file for you"

for legi in legis:
        (
                legi['first_name'],
                legi['last_name'],
                legi['district'],
                legi['party'])

for contributor in contrib:
        (
                contributor['contributor_name'],
                contributor['contributor_city'],
                contributor['amount'],
                contributor['seat'],
        )

                #write csv rows
            writer.writerows([
                    (
                    contributor['contributor_name'],
                        contributor['contributor_city'],
                        contributor['amount'],
                        contributor['seat'],
                        legi['first_name'],
                        legi['last_name'],
                        legi['district'],
                        legi['party']
                        )
                ])

print "Finished"

One thing I need to learn is how to write header rows, so I’ll work on that and update.

From here, your imagination is really the limit. With the Open States you can query legislation, committee assignments, committee hearings – all kinds of things. And don’t forget, this library interacts with the Congressional and Capitol Words API, which means there is so much data at your fingertips. So a web producer could take this and build a search from it, or a journalist could use it to search for information right from their computer. 

Let this settle in and next we’ll talk about feeding data from Open States to Google Fusion Tables to make a legi mashup map.

Feb 14, 2012
#open states API #python #Learning Library
Link via Nieman Journalism Lab: The newsonomics of signature content

Publishers, distributors, aggregators, and networks all want more money, and they’ve seen — courtesy of tablets and All-Access — that consumers are now more ready to pay for digital content than ever before.

via niemanlab.org

Reading this post from Ken Doctor — which focused a lot on digital television content providers like Hulu, Netflix and the like — makes me to wonder what print news organizations can offer in 2012 as “signature content.”

Feb 5, 2012
From John Keefe - Making GOP primary maps the Fusion Table way

Election geeks, you are in luck. For the second time, Google plans to offer free, real-time election results, allowing anyone to tinker and play with hard-to-get voting numbers.

It’s for the Nevada Republican caucuses this Saturday, February 4, and even if you have no connection to Nevada, it’s a chance to experiment with live results like the Big Guys. Make a map. Mash up some data. Have fun.

via johnkeefe.net

John’s walkthroughs helped get me started, so adding this one to the list of how-to’s is not a difficult decision… Enjoy. And thanks as always John.

Feb 4, 2012
#fusion tables #Learning Library
freeDive from the Knight Digital Media Center helps you make Google spreadsheets searchable

The Knight Digital Media Center is offering a tool called freeDive, which is a wizard to “create user-searchable databases in minutes” based on a Google spreadseet.

Best of all it’s opensource and free.

The wizard uses the Google Visualization API, Google Query Language, JavaScript and jQuery, and outputs an embed code. It was built by Len De Groot and Scot Hacker.

The wizard appears to work best with numerical data, but it’s really, really promising for a first-stage project. And again, it is opensource.

See an example, create your own or check out the FAQ page.

Feb 1, 20121 note
#data #knight digital media center
Next page →
2011 2012
  • January 13
  • February 8
  • March 14
  • April 4
  • May 12
  • June 12
  • July 2
  • August 2
  • September 2
  • October 5
  • November 3
  • December 6
2010 2011 2012
  • January 5
  • February 17
  • March 14
  • April 8
  • May 7
  • June 4
  • July 18
  • August 17
  • September 31
  • October 24
  • November 8
  • December 8
2010 2011
  • January
  • February
  • March
  • April
  • May
  • June
  • July 18
  • August 5
  • September 7
  • October 7
  • November
  • December 1