Chris L Keller ...

Text

Python scraper to check status of Madison beaches

Building on a Ruby scraper that Steve Faulkner whipped together on Saturday during a City Camp Madison session, here’s my hacked-together attempt at a Python scraper using Beautiful Soup and Requests.

It pulls data from the Madison and Dane County Public Health website to show which beaches are open and which are closed.

The terminal output looks like:

(data)(learning-scrape)… python beaches-scrape.py
BB Clarke - (open)
Bernies - (open)
Brittingham - (open)
Esther - (open)
Goodland County Park - (closed)
Governor Nelson State Park - ()
Hudson - (open)
James Madison - (open)
Lake Kegonsa State Park - ()
Lake Mendota County Park - (open)
Marshall - (open)
Memorial Union (Pier) - (open)
Olbrich - (open)
Olin - (open)
Spring Harbor - (open)
Stewart County Park - (open)
Stoughton Mandt Park Pond - ()
Tenney - (open)
Verona Fireman's Park - ()
Vilas - (open)
Warner - (open)

There’s a couple things that come to mind to add to this: The name of the lake that the beach is one, and some kind of if/else statement that can tell if the beach is monitored by another agency, and thus no status is available.

And while Steve’s exports the scraped data to JSON, mine will export it to a text file… not very useful I know, but I’m working on it.

You can also see the scraper at ScraperWiki.

Text

Using Python to send Open States API data to Google Fusion Tables and make a legi mashup map.

UPDATE: User found a couple senators didn’t have photos, which removes ‘photo-url’ from the dict. “I added a check for the key before writing and that fixed it. Reminder to anyone who tries this to check what info your state has available before starting in.”

Nearly three weeks ago I talked about feeding data from Sunlight’s Open States API to Google’s Fusion Tables to make a legi mashup map if only because it’s possible.

I’m happy to say not only is it possible, but it came together rather quickly, especially thanks to a previous walkthrough from John Keefe and a Fusion Tables Python library written by FT team member Kathryn Hurley.

Now, no doubt are there are more efficient ways to make this map happen, and even avoid Python all together. But at this stage of my abilities – pretty solid on basic Google Maps and Fusion Tables javascript API, and beginner Python – I’m a bit of a “I wonder if I can pull this off guy,” and I think this works well. It can definitely be within the wheelhouse of “intermediate to advanced beginners” and help you level up.

What We’ll Do

We will query Sunlight Labs Open States API for Wisconsin State Senator information using a Python Library and write it to a csv.

We will then upload & sync the csv to Fusion Tables using a Python script. We will merge that Fusion Table with Wisconsin State Senate district shapes and create a map that allows a user to enter their address and find their state senator.

You can see the end result here. The GitHub repo — with the code to make this happen — is here.

Starting out

To start, we’ll need a couple things…

Step 1

Let’s start with the setting the API key and installing the sunlight Python library.

Fire up your terminal and enter the following:

echo "YOUR API KEY BETWEEN THE QUOTES" > ~/.sunlight.key

Then using the python package installer pip you can install the Sunlight library.

pip install sunlight

However, pip doesn’t come with a default Python installation, so if you receive an error when you type in the command above, you can install setup tools first, which will give you access to the easy_install package manager.

Then it’s a matter of using easy_install to grab pip, and pip to grab sunlight.

easy_install pip
pip install sunlight

Step 2 (Create a Fusion Table)

Now we’re going to want to create our Fusion Table, because we’ll need to know the column headings and we’ll need the table ID.

I’ve found the easiest way to do this is to create a Google Spreadsheet and then import that to Fusion Tables. As for the column headings, I grabbed a list of the available Open States data to use the basis, though I ran into some issues with some of the fields that I’ll explain later on. I also added a GEOID column which I will use as the key column when it comes time to merge this with the shapefile. You can see my Fusion Table here.

Once you have your Fusion Table, you will be able to find the table id by going to File –> About in the Fusion Tables interface.

Step 3 (Python)

If you grabbed the GitHub repo I created, you’ll find that the sunlight-ft-python directory contains several Python scripts. We’re concerned right now with:

  • data_search.py: This searches Sunlight’s Open States api and write the data to legi.csv
  • data_import.py: This uses authentication to sync the data in legi.csv file to the Fusion Table you created.

In the first part of data_search.py, we’re importing the sunlight library and Python’s csv writer. We’re then setting our search variables – wisconsin and upper chamber in this case – but we could have easily just adding them in the next section. I pulled them out just to make them more obvious.

#import libraries
import sunlight
import csv

#variables for the search
state_name = "wi"
chamber_name = "upper"

#pull API data
legis =  sunlight.openstates.legislators(
    state=state_name,
    chamber=chamber_name,
)

In the second part of data_search.py, we’re opening an csv file to write to and then looping through the data that is being returned from the API.

I’m checking to see if the district number being returned is a single-digit district or a double-digit district. If it’s single I’m adding an extra zero to the GEOID designator. If it’s double, it only gets one zero. This is the column I will be using to merge my Fusion Table with the shapefile. There are many ways to pull this off, and you may have a better solution.

The information contained within writer.writerows match column headers of my Fusion Table. And after each step of the loop, I’m writing the information to the csv and closing the file.

#open csv writer
writer = csv.writer(open('legi.csv', 'wb', buffering=0), delimiter=';', quoting=csv.QUOTE_ALL)

#open loop
for legi in legis:

    geo_pre = legi['district']

    if len(geo_pre) == 1:
        item = "5500"

        #write csv rows
        writer.writerows([
            (item + legi['district'],
            legi['full_name'],
            legi['last_name'],
            legi['first_name'],
            legi['middle_name'],
            legi['photo_url'],
            legi['state'],
            legi['chamber'],
            legi['party'],
            legi['district'])
        ])

    elif len(geo_pre) == 2:
        item = "550"

        #write csv rows
        writer.writerows([
            (item + legi['district'],
            legi['full_name'],
            legi['last_name'],
            legi['first_name'],
            legi['middle_name'],
            legi['photo_url'],
            legi['state'],
            legi['chamber'],
            legi['party'],
            legi['district'])
        ])

To run this, fire up your terminal, cd into your directory and run

python data_search.py

You should see a file titled legi.csv in the directory and hopefully is has data in it? If so, we’re now ready to work with data_import.py.

There is really only one thing we need to do here, and that is adjust the column headers. Towards the bottom – line 37 or so – you’ll find the following:

cols = ["GEOID",
  "full_name",
  "last_name",
  "first_name",
  "middle_name",
  "photo_url",
  "state",
  "chamber",
  "party",
  "district"]

Here is where we define the columns we want to push to Fusion Tables and sync.

Once that’s adjusted, head back to the command line and you will use the following, adding in your Google docs account, the name of the csv file – legi.csv – and the ID of your Fusion Table.

python data_import.py [google account username] [csv file] [fusion table id]

If all goes well, you will be asked for your Google account password and you will see each column being written. The script is set to update the table every two minutes, and since we don’t really need that, you can press CTL + C once everything is successful. You’ll know because your Fusion Table will be populated with brand new data.

Step 4 (Merging with your shapefile and making your map)

I won’t spend too much time here on this step because there are several walkthroughs available; I have a list of some here. But by all means, should you have questions or suggestions, don’t hesitate to let me know.

Anyway, from here we’ll just merge our two Fusion Tables together – the table with the data and the table that contains the shapefile you uploaded. Save the resulting map, and use either the iframe embed method or Fusion Tables Layer Builder to add your map to a HTML page.

Additional Resources

Google’s Fusion Tables

Google’s Fusion Tables Presentation

Fusion Tables Help

Google’s Fusion Tables Group

Google Maps API

Google’s Maps API Group

TIGER/Line Shapefiles

Shape to Fusion, aka shapescape

Text

Fresh from #nicar12, here are curated notes to set up a Windows 7 Python environment so I can practice at work

The following are bullet points gathered from walkthroughs created by Anthony DeBarros and the Kenneth Reitz’s Python Guide to get a Python development environment up and running on my work machine.

I needed admin rights for only two steps:

  • Python 2.7.2 installation (Don’t install 3.1)
  • Setting the Python path

Once those steps are completed, these steps will get easy_install pip, virtualenv, django and many other packages up and running.

Download Distribute for Windows and save the script to your Python27 directory.

Open your command prompt by clicking Start Menu and search for cmd to find cmd.exe. You will change into the Python27 directory and run the distribute setup.py script. In turn, you will be able to install pip which is really the gateway drug to adding packages and learning your way.

cd C:\
cd C:\Python27
python distribute_setup.py
easy_install pip
pip install virtualenv

My virtual environments exist inside of C:\Python27\Scripts, so to get there, change to that directory and create a new virtual environment.

cd Scripts
virtualenv --distribute <environment name>

To activate virtual environment:

cd <environment name>
cd Scripts
activate.bat

From here, you can use pip to install python packages to the virtual environment

pip install django
pip install csvkit
pip install BeautifulSoup
pip install mechanize

To deactivate a virtual environment

deactivate

My next goal is to get a version of virtualenvwrapper running, which makes managing virtual environments a piece of cake.

Text

How a beginner used Python to interact with the Open States API

Updated with some corrections/clarifications from Paul Tagliamonte, who hacked on the newly-released python-sunlight library and blogs here.

Updated with new code snippet to use terminal input to search for campaign contributions to legislators and write to a csv.

A little less than a year ago I tried to play with the Open States legislator API, and with a bunch of help came away with a simple php-based search of our legislators.

If you haven’t played with the Open States API before, you should sign up for an API key and start, but there is a wealth of information — 47 states worth — just waiting to be unlocked by your imagination.

If you at least a passing interest in python, here are some steps that a beginner can use to grab legislator data and write it to a csv. I’ll follow up later with a method of sending that data to Google’s Fusion Tables where you could integrate into a map.

So where to start?

On Monday, Sunlight Labs — part of the Sunlight Foundation with created Open States — announced an updated python library to interact with it’s Congress, Open States and Capitol Words APIs.

You will need an API key to use it, but it’s just a matter of registering and waiting for the email.

Once you have the API key, Sunlight Labs has a couple easy steps to get setup. This is straight from their blog post:

If you’re on a UNIX-type (MacOS, GNU/Linux, *BSD, AIX or Solaris (or any of the other POSIX-ey systems)) machine, you should be able to run a command that looks like the following:

echo "your-api-key-here" > ~/.sunlight.key

It’s worth mentioning that your-api-key-here should actually be your API key that was emailed to you up above.

Via Paul, here’s what is happening at this step:

…this actually creates a “dot” (really: hidden on UNIX systems, so it doesn’t cloud up your folder) file in your home directory

The tilde (~) expands to where your home folder is — so on my system, it’d be something like

/home/tag/

So this actually writes a file to

/home/tag/.sunlight.key

so that each user on the system can have their own key.

The lib (during runtime) will look to see if this one file is present, and use that if it is. You can see if it’s there (as well as tons of other dot-files) by running something like:

ls -la ~

Basically — and correct me if I am wrong here — this you want to make a “dot” file .sunlight.key and place it in your root directory and tell python where to find it.

Then using the python package installer pip, bring the library to your system.

pip install sunlight

Then using the python package installer pip you can install the Sunlight library.

pip install sunlight

However, pip doesn’t come with a default Python installation, so if you receive an error when you type in the command above, you can install setup tools first, which will give you access to the easy_install package manager.

Then it’s a matter of using easy_install to grab pip, and pip to grab sunlight.

easy_install pip
pip install sunlight

Now whenever a walkthrough suggests placing a file that begins with a dot in a directory that has ~ stand in for it, I get a little nervous; I’m not gonna lie. This one was fairly painless.

The documentation and GitHub repo has some sample code — a lot of sample code — and I repurposed a portion for this little snippet. You can save as legi-return.py, and it will return all the legislators for Wisconsin.

#imports the library
import sunlight

#query for wisco legislators
legis =  sunlight.openstates.legislators(
    state='wi'
)

#identify each legislator in the output
for legi in legis:

    #return each legislator
    print "%s %s (%s) District: %s Party: %s ID: %s" % (
        legi['first_name'], legi['last_name'], legi['chamber'], legi['district'], legi['party'], legi['leg_id'])

Just cd into the directory and type python legi-return.py into the terminal and you should see some output.

Capturfiles

If you get an error, it might be of the indentation variety, so I’ve also made a gist of the code.

Once I got this up and running, and being a python beginner curious of my own abilities, I started to wonder if I could create a search of legislators from the terminal. So I made this code snippet to search for Wisconsin state legislators by last name. You can call this legi-lookup.py if you want to.

#import library
import sunlight

#ask for input
legi_name = raw_input("Enter the Legislator's Last Name...")

#pull wisco legis with last name entered by user
legis =  sunlight.openstates.legislators(
    state='wi',
    last_name=legi_name
)

#tell the user what you found
print "We found these legislators" 
for legi in legis:
    print "%s %s (%s) District: %s Party: %s ID: %s" % (
        legi['first_name'], legi['last_name'], legi['chamber'], legi['district'], legi['party'], legi['leg_id'])

As you can see, this borrows quite heavily from the first snippet. The difference is I’m asking for input from the terminal — use Fitzgerald — and then narrowing the results based on the input.

Capturfiles_1

Also, this area here…

legis =  sunlight.openstates.legislators(
    state='wi',
    last_name=legi_name
)

…can take all kinds of parameters to narrow your query. There’s a list here you can use.

So I’m feeling all good about thing, and now I want to take my initial query — for all of the state legislators — and I want to write them to a csv. I had this from a prior walkthrough…

import csv

#opens the csv writer, specifies the file name
writer = csv.writer(open('stocks.csv', 'wb', buffering=0))

#data that should be written, separated by comma
writer.writerows([
    ('GOOG', 'Google, Inc.', 505.24, 0.47, 0.09),
    ('YHOO', 'Yahoo! Inc.', 27.38, 0.33, 1.22),
    ('CNET', 'CNET Networks, Inc.', 8.62, -0.13, -1.49)
])

…so I figured the principals were the same.

It actually wasn’t that difficult to figure out since I was already identifying the specific fields that I wanted to be output. You can save this snippet as legi-to-csv.py.

#import libraries
import sunlight
import csv

#pull wisco legis
legis =  sunlight.openstates.legislators(
    state='wi'
)

#open csv writer
writer = csv.writer(open('legi.csv', 'wb', buffering=0), delimiter=';', quoting=csv.QUOTE_ALL)

#open loop
for legi in legis:

    #write csv rows
    writer.writerows([
        (legi['first_name'], legi['last_name'], legi['chamber'], legi['party'] )
    ])

Then assuming you are in the same directory, type python levi-to-csv.py into the terminal and you should see a file called levi.csv added to your directory. Hopefully it has a series of comma delimited fields of legislator goodness.

Capturfiles_2

UPDATE

So here is one more for you…

Firing this snippet in the terminal will ask you for a legislator’s last name, state and campaign cycle, query influence explorer and then write the results to a csv.

#!/usr/bin/env python

#import library
import sunlight
from sunlight import influenceexplorer
import csv

#ask for input
legi_name = raw_input("Input Legislator's last name...")
legi_state = raw_input("Enter state...")
legi_cycle = raw_input("Enter campaign cycle...")

#pull legis with last name entered by user
legis =  sunlight.openstates.legislators(
        state=legi_state,
        last_name=legi_name
)

#queries against influence explorer
contrib = influenceexplorer.contributions(
        contributor_state=legi_state,
        recipient_ft=legi_name,
        cycle=legi_cycle
)

#open csv writer
writer = csv.writer(open('money.csv', 'wb', buffering=0), delimiter=';', quoting=csv.QUOTE_ALL)

print "Writing CSV file for you"

for legi in legis:
        (
                legi['first_name'],
                legi['last_name'],
                legi['district'],
                legi['party'])

for contributor in contrib:
        (
                contributor['contributor_name'],
                contributor['contributor_city'],
                contributor['amount'],
                contributor['seat'],
        )

                #write csv rows
            writer.writerows([
                    (
                    contributor['contributor_name'],
                        contributor['contributor_city'],
                        contributor['amount'],
                        contributor['seat'],
                        legi['first_name'],
                        legi['last_name'],
                        legi['district'],
                        legi['party']
                        )
                ])

print "Finished"

One thing I need to learn is how to write header rows, so I’ll work on that and update.

From here, your imagination is really the limit. With the Open States you can query legislation, committee assignments, committee hearings – all kinds of things. And don’t forget, this library interacts with the Congressional and Capitol Words API, which means there is so much data at your fingertips. So a web producer could take this and build a search from it, or a journalist could use it to search for information right from their computer. 

Let this settle in and next we’ll talk about feeding data from Open States to Google Fusion Tables to make a legi mashup map.

Text

I got 99 reps to scrape so I learned me some Python

Updated: I’ve added some excellent information and knowledge from Ryan Pitts, who took the time to walk me through a couple things on a Saturday morning. Much thanks.

In getting ready for the 2012 elections, and politics in general, it makes sense to grab data and information about Wisconsin’s state senators and state representatives.

With last summers senate recall elections, I had already gone through and grabbed names, photo urls, website links and contact information for the state’s 33 senators.

On each senator’s state webpage, state capitol and district contact information and biographies are also listed.

So to add those I bit the bullet — there are only 33 after all, and speed was a factor — so I grabbed that information using an ImportHtml in a Google spreadsheet combined with some old-fashioned find and replace and copy and paste.

Here’s an example

ImportHtml(“http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Senate&district=1”, “table”, “7”)

But the state assembly. Now there’s a daunting task. The state has 99 representatives, and the thought of the ImportHtml method and CTL-C, CTL-V … repeat … was not something I looked forward to.

So through Kevin Schaul’s Web scraping with Django tutorial I had played around a bit with the Requests — a Python library? module? And through some of the really basic tutorials on ScraperWiki I had figured out bit about lxml.

And today seemed like a good time to put it all together? And then some.

The basic scraper came together fairly easy… But scraping a URL, copying the content from the terminal into a spreadsheet, changing the URL … repeat … wasn’t much of an answer. So why not take the time to LEARN SOME PYTHON instead of just schlepping my way through a task.

Each state representative has a webpage that can easily be determined by the district number.

This URL —

http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Assembly&district=1

— belongs to 1st Assembly District Garey Bies

And to grab a bio for Garey Bies, the url has “&display=bio” appended to the end.

The javascript I have learned over the past year gave me an idea on how I could combine a url and a variable together.

And I remembered this little Python tutorial from last spring, so I knew a bit about looping through results. So all I had to do was figure out how to make all of this happen… And the write the output to a CSV?

Well I nearly pulled all that off, save for a couple things that didn’t quite work the way I had hoped — once I ran into encoding issues, I knew I was veering off the path — but to say this was a “foundational” learning experience is an understatement, and I accomplished the task I had at hand.

The code is below, and at this late hour it all looks like mush… but the comments — I comment everything and likely will until shamed into doing otherwise, but it’s something that the reporter and editor in me believes in — walk through what is happening.

So I just let the bios of 99 reps output to the terminal, copied that to the text editor, did some find and replace, pasted into the spreadsheet and I was done… And am left feeling I learned a lot in the process. And if allowed to get a bit sappy… It was a really encouraging way to close out what has been a tremendous year for personal learning. 

I did run into a handful of obstacles, and am not equipped to figure out a solution…

  • It would have been slick to figure out to add some regular expressions to format the output.
  • My attempt to write to a CSV was successful, though every space in the output was replaced with a comma. There’s sure to be some formatting that could be used there.
  • Even the attempt to write to a text file wasn’t truly successful, as I ran into the following error after the first pass: UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\u2019’ in position 269: ordinal not in range(128). So I threw that plan away, but the code remains and is commented out.

Thanks to Ryan Pitts, I’ve learned that adding .encode(‘utf-8’) to data = el.text_content().strip() gets me past the error when trying to write the output to a text file. But as he points out, there will be other things to deal with, so I’ll keep working on that…

And through Ryan I also learned about Beautiful Soup — will need more time to look through that — and some great resources on handling text in Python:

Here’s the updated code… As always, pointers, tips or links to learning resources are most welcome.

Tagged: data python