Chris L Keller ...

Month

December 2011

8 posts

I got 99 reps to scrape so I learned me some Python

Updated: I’ve added some excellent information and knowledge from Ryan Pitts, who took the time to walk me through a couple things on a Saturday morning. Much thanks.

In getting ready for the 2012 elections, and politics in general, it makes sense to grab data and information about Wisconsin’s state senators and state representatives.

With last summers senate recall elections, I had already gone through and grabbed names, photo urls, website links and contact information for the state’s 33 senators.

On each senator’s state webpage, state capitol and district contact information and biographies are also listed.

So to add those I bit the bullet — there are only 33 after all, and speed was a factor — so I grabbed that information using an ImportHtml in a Google spreadsheet combined with some old-fashioned find and replace and copy and paste.

Here’s an example

ImportHtml(“http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Senate&district=1”, “table”, “7”)

But the state assembly. Now there’s a daunting task. The state has 99 representatives, and the thought of the ImportHtml method and CTL-C, CTL-V … repeat … was not something I looked forward to.

So through Kevin Schaul’s Web scraping with Django tutorial I had played around a bit with the Requests — a Python library? module? And through some of the really basic tutorials on ScraperWiki I had figured out bit about lxml.

And today seemed like a good time to put it all together? And then some.

The basic scraper came together fairly easy… But scraping a URL, copying the content from the terminal into a spreadsheet, changing the URL … repeat … wasn’t much of an answer. So why not take the time to LEARN SOME PYTHON instead of just schlepping my way through a task.

Each state representative has a webpage that can easily be determined by the district number.

This URL —

http://legis.wisconsin.gov/w3asp/contact/legislatorpages.aspx?house=Assembly&district=1

— belongs to 1st Assembly District Garey Bies

And to grab a bio for Garey Bies, the url has “&display=bio” appended to the end.

The javascript I have learned over the past year gave me an idea on how I could combine a url and a variable together.

And I remembered this little Python tutorial from last spring, so I knew a bit about looping through results. So all I had to do was figure out how to make all of this happen… And the write the output to a CSV?

Well I nearly pulled all that off, save for a couple things that didn’t quite work the way I had hoped — once I ran into encoding issues, I knew I was veering off the path — but to say this was a “foundational” learning experience is an understatement, and I accomplished the task I had at hand.

The code is below, and at this late hour it all looks like mush… but the comments — I comment everything and likely will until shamed into doing otherwise, but it’s something that the reporter and editor in me believes in — walk through what is happening.

So I just let the bios of 99 reps output to the terminal, copied that to the text editor, did some find and replace, pasted into the spreadsheet and I was done… And am left feeling I learned a lot in the process. And if allowed to get a bit sappy… It was a really encouraging way to close out what has been a tremendous year for personal learning. 

I did run into a handful of obstacles, and am not equipped to figure out a solution…

  • It would have been slick to figure out to add some regular expressions to format the output.
  • My attempt to write to a CSV was successful, though every space in the output was replaced with a comma. There’s sure to be some formatting that could be used there.
  • Even the attempt to write to a text file wasn’t truly successful, as I ran into the following error after the first pass: UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\u2019’ in position 269: ordinal not in range(128). So I threw that plan away, but the code remains and is commented out.

Thanks to Ryan Pitts, I’ve learned that adding .encode(‘utf-8’) to data = el.text_content().strip() gets me past the error when trying to write the output to a text file. But as he points out, there will be other things to deal with, so I’ll keep working on that…

@ChrisLKeller Adding that .encode() will make your write work fine, but you’ll have stuff like ‘Sheriff’s Dept.’ to handle later.

— Ryan Pitts (@ryanpitts)

December 31, 2011

And through Ryan I also learned about Beautiful Soup — will need more time to look through that — and some great resources on handling text in Python:

  • Unicode How-To
  • “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky. 

Here’s the updated code… As always, pointers, tips or links to learning resources are most welcome.

Dec 31, 2011
#data #python
I don't doubt this thought for one second...

Journalism still hasn’t found it’s Steve Jobs. He likely got laid off as a middle manager for being different. #wjchat

— Yuri Victor (@yurivictor)

December 29, 2011

Especially given the number of middle managers and journalists given their walking papers since 2008.

Dec 29, 2011
#journalism
Looking back at 2011 and wondering how I got to this place, cause that journey was pretty fun

For the first time in my nine years with Lee Enterprises my annual review falls inside the tidy boudaries of the calendar year. This means I finally stand a decent chance of escaping the Pavlovian responses to August and September, and find my sense of a “new beginning” coming when the calendar flips to January.

So, as I look back at my first year with madison.com (ok, 11 months) it’s difficult to separate some of the personal goals I had for 2011 from my professional goals. And maybe because on the whole, I’m left feeling that while I “scored” very well in achieving what I wanted to personally, my “scores” lacked quite a bit – by my standards – in the professional arena.

Still, there are inklings and hints as the calendar flips to 2012, that the two areas may once again start blending together.

So, what did I set out to accomplish for myself in 2011? How did I score?

  • Write a javascript function from scratch – ok maybe using google – to create something other than an alert box.
  • Be able to play guitar and sing “Long Road,” “Heart of Gold,” or “Gloria” at the same time in a voice that I can be proud of.
  • Earn one freelance gig redesigning someone’s wordpress site.
  • Write something once a week. Anything.
  • Do something will all my old writing.
  • Quit smoking…while I am at the office. Outside the office will come in 2012.
  • Record one spoken word piece.

Of these, it feels really good to know that I can bascially cross everything off my list except for playing guitar and singing at the same time – my neighbors are probably very happy that this didn’t happen – and recording a spoken-word piece.

But I quit smoking straight up — well, save for a few here and there at certain times and in certain places, and though I didn’t do much with my old writing, but I stared at it enough to know what is there and what might be possible, and I combined some of that with my want to learn more javascript to create a short story presentation.

And I don’t know if I wrote something each week, but I think I came close thanks to some opportunities – unforeseen when setting goals in January – that came my way and opened so many other doors and allowed me to accomplish other goals.

Seriously need a google map layer of the 111 congress. Anyone have any bright ideas?

— Jon Davenport (@JonDavenport1)

March 28, 2011

For instance, I don’t think it’s a stretch to think that a single tweet from a Lee Enterprises colleague sent me off into a world of Google Fusion Tables and Google Maps API, which provided my a reason to learn javascript and allowed me to learn from John Keefe and others, and turned me onto to the power of tweeting shop with a host or really smart journo-techs, and left me wanting to foster a similar discussion within Lee Enterprises, so I got a company blog to share what I had been learning, and then found out about a little meetup in Chicago where folks would be brainstorming ideas for a project call the Knight-Mozilla MoJo Innovation Challenge, and so I submitted an idea – one of more than 300 that were submitted – and it was selected to be part of a learning lab with 59 other people, and so I submitted a final project for the lab, and that got selected for the next phase which took me to Europe for the first time where I met so many wonderful people that have shaped my thinking about journalism and the web for the better part of six months now, and so I have sought out ways to share that thinking with Communication students at the university I attended — and as byproduct learned that I’m 12 credits short of an associate’s degree, or three total semesters short of getting that elusive four-year degree — or through fostering a discussion with Lee Enterprises colleages.

And if reading that paragraph felt a bit disorienting, living it was a bit more so. Trust me.

So what does 2012 hold? What do I want to accomplish? I aim to write that up later this week, after I finish my performance review. But I’m left feeling the list might include things that you can’t link to on the internet. Character traits seldom have web pages.

Dec 29, 2011
#2011 #goals
Want easy Google Spreadsheet script to geocode addresses and export as GeoJSON? Thanks to @developmentseed here it is

…We’ve developed an add-on script for Google Docs Spreadsheets that lets you geocode arbitrary addresses and export spreadsheets as GeoJSON, a file format that works in TileMill.

via developmentseed.org

This Development Seed script is hosted on GitHub, and seems pretty accurate. It works with Yahoo! (API key required) and Mapquest geocoders.

Dec 27, 2011
#data #google #map
Link: Register Citizen Newsroom Cafe celebrates one-year anniversary

There will never be a good time to commit time to audience engagement, becoming more transparent, trying new things and training staff, especially in a newsroom as small as ours. You have to “just do it.”

via connecticutnewsroom.wordpress.com

The above, part of an excellent post from the Journal Register Company’s Matt DeRienzo, could extend to so many aspects of a news organization. But in the context of a newsroom, where the daily routine is rote and sacred, it’s exponentially true.

Dec 19, 2011
More fun with Highcharts and Fusion Tables

Further stripping down the Fusion-Tables-To-Highcharts project, this uses Google Fusion Tables as a data backend to draw a chart that uses the Highcharts.js library.

Unlike the previous project — walkthrough is here — this doesn’t rely on a map click event to draw the chart, allowing this Fusion Table…

… to become this column chart …

… or a bar chart …

… or a line chart …

I’ve been interested in finding an easy to use datasource for these charts since Wisconsin State Journal multimedia and graphics gurus Jason Klein and Laura Sparks started to use the Highcharts javascript library to produce data visuals that could be used in print, on the web and on tablets.

The “problem” to solve is two-fold:

  • It’s nearly a must to be able to keep track of the historical business data that we’re adding to highcharts visuals.
  • Because this data is often in the hands of the reporter, the method for updating the charts needs to be ultra simple, and involve touching a little code as possible.

From here I’m thinking of learning from Kevin Schaul’s recent project, the box-chart-maker, which has a great user-interface that will generate the code needed, and ideally figure out a way to write to and update data housed in a Fusion Table, at least by the time Wisconsin has what could be the first of six potential election nights in 2012.

View the demo.
Fork the repo.

Dec 12, 2011
#data visualization #fusion tables #Learning Library
One day, someone new will be running things...

Maybe “I downloaded but didn’t share” will be the new “I smoked, but didn’t inhale.

via waxy.org

Found the above quote via Dan Sinker’s blog: Andy Baio, writing about a remix of Pulp Fiction posted to YouTube, and observing that “everyone over age 12 when YouTube launched in 2005 is now able to vote.”

Dec 12, 2011
#remix-culture #youtube
Sometimes, three letters says it all...

WoW

— Chris Paul (@CP3)

December9, 2011

via twitter.com

Dec 9, 2011
#nba
Next page →
2011 2012
  • January 13
  • February 8
  • March 14
  • April 4
  • May 12
  • June 12
  • July 2
  • August 2
  • September 2
  • October 5
  • November 3
  • December 6
2010 2011 2012
  • January 5
  • February 17
  • March 14
  • April 8
  • May 7
  • June 4
  • July 18
  • August 17
  • September 31
  • October 24
  • November 8
  • December 8
2010 2011
  • January
  • February
  • March
  • April
  • May
  • June
  • July 18
  • August 5
  • September 7
  • October 7
  • November
  • December 1