On Friday 25th March I attended the Scraperwiki Hacks and Hackers Day at BBC Scotland in Glasgow. As it turned out I was too busy taking part in the hacking to get my notes up on the day so this is a very belated “live blog” largely covering the opening and closing sections of the day. It’s also well worth looking the BBC videos of the day and the Official ScraperWiki blogging on the day.
Aine McGuire introduced the day. Hacks & Hackers Day Glasgow has been arranged by ScraperWiki, supported by BBC Scotland, the BBC College of Journalism and with prizes donated by the Guardian Open Platform.
Francis Irving was next up giving an introduction to Scraper Wiki (see also the video below)
Julian Todd, CTO of SraperWiki had started the site when wanting to know how his MP had voted. Other people were interested, and thus started TheyWorkForYou. Along with voting records some new data was added about divisions etc.
Richard Pope, had a local pub in Brixton called The Queen. One day it disappeared. He was outside the zone where he would have received a letter about planning decisions. He screen scraped all of the local planning websites and set up a site to automatically alert you. PlanningAlerts.com was born but the site requires hundreds of scrapers to work properly so it does not entirely work at the moment. Thus one of reasons behind ScraperWiki is to maintain more ambitious scrapers.
A Quick Technical Diversion
ScraperWiki lets you write Python, PHP or Ruby scripts in your browser. Data goes into an SQLite data store. You can put data in and run queries to extract it. It has a scheduler that can run in the cloud so you don’t have to maintain and run these all the time. You can also make a request for a new dataset or a scraper from the broader ScraperWiki community.
The whole idea of today is to use non structured data sources rather than structured data sources. We want to form project teams around data that interests you. And to ask what data sets you might be interested in. We form groups, we create a story and we all give a presentation at 6pm (for 3 minutes) and there will be a prize for the best project. Judges will be: Hugh Owen, Editor of Good Morning Scotland, Allan Donald of STV and Jon Jacob of the BBC College of Journalism.
What data do you care about?
At this point we went round the room sharing our data interests:
- Hacker at the BBC – I’m interested in sports and election data
- Brendan Crowd, BBC R&D – I have a particular idea, augmented reality, mobiles, and political manifestos. Scan an area and get summaries of manifesto data about, say, infrastructure.
- Ally Craigwell, Developer at BBC Scotland – Just here to get involved.
- Mo McRoberts, Data Analyst in BBC Dev – Geographical data is my main area of interest.
- Martin Inglis, Guardian – We are coming up to the 5th anniversary of the smoking ban so maybe data about that?
- Robert McWilliam, hacker at BlueFlow, a startup in Aberdeen – I’m here to get involved too.
- Julian Todds, Scraperwiki hacker – I’m interested in seeing what Scottish Parliament difficult stuff we can help with – we did lots of this at the Cardiff hackday.
- Chris Sleight, hack for BBC Newsroom for Central Scotland – The Police and Fire service are publishing data to the web without structure so I would like to put that into a reusable format.
- Paul Millar, hacker and Drupal Developer – I’m interested in tax-funded stuff generally.
- Ben Lyons, Institute for Research and Innovation in Social Services, Hacker – I’m happy to play with various data, I’m interested in data visualisation (done programmatically so that you have up to date visualisations).
Julian noted that he would be showing some visualisations created by ScraperWiki for Dispatches that might be of interest.
- David Eyre, Hack, Senior Broadcast News journalist with Gaelic service – There have been a spate of fire deaths over the New Year. I’d like to scrape data on that and look for trends and overlay that with data on deprivation etc. Also there is a relatively new websites for public authorities on notifications – planning applications, roadworks etc -and it would be good to look at that and think about how to use that data and make it usable elsewhere.
- James Baster, Hacker – I’m just generally interested.
- Nicola Osborne, Social Media Officer at EDINA, hacker(ish) – I was interested in data that connects to academia as well as ways in which geo data and middleware could be used as part of hacks and mashups with other data.
- Finn, Hack – I am interested in arts and culture and education data.
- Michael McLeod, Hack for Edinburgh Local blog for the Guardian – I am interested in sharing data. I will be covering the election in Edinburgh so it would be great to see data on the voting records of MSPs here. Also what does Edinburgh [the City/City Council] own and how does that relate to cuts that are happening/coming up?
- Anand Ramkissoon, Viewsouth, occasional hacker – I am working on a Project on General Practices dealing with most deprived areas in Scotland – and calling in of support services by these people. so preventative health/social prescribing organisations data.
- Bob Kerr, Open Street Map, Hacker – My interest is to overlay data with maps. However as a side project I want RCAHMS to release their data (but that’s more of a copyright than a technical issue).
- Peter, MacKay, Hack, Journalist and blogger with the Gaelic service – I’m interested in public expenditure and public building ownership, planning applications, quangos etc. I would like to see new companies created/ funded by the public sector in some way to be mapped onto information about the percentages of public/private funds/ownership etc. I’d like to look at the oil industry and the effect of the oil industry in the UK.
Julian noted that ScraperWiki have a data set of all the deep oil wells in the UK – every well has a webpage that can be accessed.
- Sean Carroll, Developer at BBC Scotland – I’m just generally interested.
- Bruce Munro, learning department at BBC Scotland – I’m particularly interested in data and in making it manageable for young people and teachers to use in education – stuff like political information.
- Paul McNally, Hacker, background as an artist hacking websites to display information – I want to consider how you search for information? What information do people want? How do you ask the question in order to find information that’s already there?
- Jon, who is here to film us today along with Angelique from the BBC College of Journalism – we are here to film today and to see what the joy is in all of this. It looks like maths and numbers. And I think this sounds quite dry – what’s the motivation or the thrill?! We’ll be coming around today to talk to you and find out.
- Angelique, CoJo (BBC College of Journalism) – I’m interested in lots of the stuff here and the stories that are surfaced through data journalism – looking behind the stats, questioning data, FOI etc.
Julian closed off the ideas section adding that we’d had mention of visualisation. At the end of the day you could build an application, could be data, could be numbers. If you have good visualisation skills it would be great for you to be flexible as many of the teams may need help with this.
The Outputs for today could be any of 3 quite different types of data journalism. The output might be a headline and a story that comes out of data; it might be a tool that is helping the user ask questions – perhaps visualisation to explore data; or it might be about building an ongoing conversation – so ongoing applications etc. Journalism doesn’t do much of the third category but it will need to more of that in the future to compete with other organisations.
So, to finish off the introductory part of the day you might be wondering how ScraperWiki makes money? Well private scrapers (the default is public and shared) or excessive API calls can be charged for. And for consulting/pieces of work for others – big organisations don’t need to do this stuff regularly just occasionally so it makes sense to contract it out. So we worked with Channel4 News and the Dispatches programme.
Cue two visualisations which were created for Dispatches. These fit into the “helping the user ask questions” category. The visualisation consists of a blob that, when clicked, splits expenditure into different categories so that the user can then see a summary of data when you clicked into a category [At this point Julian showed the military spending example and explained that Single Use Military Equipment isn’t actually for a single use, it’s for military use only and not at all for civilian use – a little factoid there].
Julian also showed a map visualisation showing publicly owned brownfield land that could be used for housing (sliced geographically). The data only covered England though.
Julian added the wonderful footnote at this point that “constructing sentences is a massively underused type of visualisation”!
And thus began a very long and very busy day of hacking. At 6pm we stopped typing and presented what we’d done…
Aine opened up the presentation part of the day by saying that “Coders can change the world! Put them in a room with any other discipline and it’s magical!”. With that cheery opener the presentations began (you can find full info on who was part of each team on the ScraperWiki blog post on the event):
At the end of the hacking we all presented our work to the group and the judges:
We were looking at fire data. We were looking at the Central Scotland fire and resue website – all their calls are here but not in a good format. 60 calls were visible but we found that there were some 15,000 records buried behind that front page!
We focused on Malicious False Alarms – what is the percentage? by which area? The address of the caller is available in a fairly uniform format, as is text describing the fires. So the Scraper (firebug simple viz 1). It turns out that 3.5% of calls are malicious. We used Protoviz to visualise. We found that in Avonbridge (a small village) 30% of all calls are malicious. This sort of understanding of the data is really interesting to know about and offers huge potential for reporting. You could set up a twitter feed with location data embedded. You could also compare to Scottish and UK data and look for clusters and trends.
Public Buildings for Sale
This was a less visual project. How much public land or property is sold without the knowledge of the public who it belongs too? It turns out to be a hideously complex structure of data – a search engine of doom on scottish-property.gov.uk. We wanted to scrape this and create a map so that you could look at what was for sale or rent near you and what had already been sold.
Technically speaking it was our lack of experience with ScraperWiki and the really complex structure of nested HTML tables that scuppered us. We did pull some data out but it’s not right yet (Scraper is: miglis/Scottish Executive Property Scraper).
Edinburgh Planning Application Map
Michael McLeod & Robert MacWilliams were the team here. Michael explained that part of his job is to find planning applications. To do this on the Edinburgh City Development portal requires a reference – an area or postcode is not good enough. It is really difficult to use. Lots of the UK planning applications are managed similarly badly online. In the case of Edinburgh a map on the portal doesn’t even work. But the plans are updated as PDFs weekly or so we thought…
Robert spotted that although dated as weekly updates the PDFs are being added on a daily basis. Robert used ScraperWiki to put the data into a map – to be updated every day so that you can see what is planned via a link back to the planning application. So if you Google the plan associated with a recent Facebook page from a paintballing company “When this page gets to 10k likes we’ll release the address” from a journalists POV want to see if it’s next to people’s houses – and it turns out that it is. Micheal added that he will be knocking on doors on Monday to see what fellow residents think of Urban Paintball behind their flats!
Anyway back to the map. The planning data that has been created forms an RSS feed that will automatically tweet out new applications so that you can monitor what is going on where in the city.
Hide by the Clyde [my team]
Our team was myself, Bruce Munro and Sean Carroll from BBC Scotland and Bob Kerr of Open Street Map and the project initially started out to look at areas of social deprivation and compare this with data from Scottish schools on attainment, employment etc. We worked with LTS Scotland and Scottish Government data – both organisations provided a wealth of information online but as a searchable site and as documents for download rather than as APIs.
We started with Scottish Schools Online – this is just a simple search on the site. Sean wrote a custom PHP script to scrape data from the site – the data was well structured on the site so by searching for all schools then munging the returned HTML you could discern the type of data from, amongst other things, the div class being used.
Sean used the addresses of the schools with the Google Maps API to build up a map of Scottish schools but that was all that was coded in the day.
However when we were investigating possible data sources that could be useful Bruce and I had looked through the Scottish Government data in particular to see what was available and if there were statistics that would help illustrate the need to make this data more easy to engage with. We showed some visualisations (created with Many Eyes) to illustrate the importance and some broad trends in this data with; a map showing the level of free school meal uptake in Scotland (warning: this link may well crash your browser!); a bubble chart of two local school authorities: Glasgow and Argyll and Bute (see below):
Despite having relatively similar poverty levels (as a percentage of the population) other outcomes vary highly between these authorities indicating cultural and social differences between urban and rural areas.
There is a huge amount that could be done with this data but most Scottish government data is currently being provided as spreadsheets for download making a real need for an API or a suite of Scrapers/
Crash Test Dummies
This BBC team looked at data on road traffic accidents in Scotland. They looked particularly at North East Scotland and, using that data, built a quick website “Mind How You Go” based upon 2005-9 data. You are asked questions and are presented with statistics about your chances of getting home in one piece.
The team also found a timeline of initiatives over the year – seatbelts, speed limits along with data on populations, licensing and accidents. From this data the team built a spreadsheet to compare accidents. Casualties per car go down over time so with historic data one could produce extensive visualisations with timelines to view the risks associated with driving under specific conditions.
Mo, who was on this team, wanted to do a meta study of how much RTA were reported in the news and where. So he wrote a scraper which performs a Google Search of BBC Scotland based on specific crash related search terms. The scraper fetched the results, parsed the stories, then used with GeoNames data to find nearest city. Then we tried to automatically find the road – A & B roads were easy, but other roads not so much. From this we created a Google Map of all of the accidents in 2010. This showed quite a lot of geographically biased reporting of accidents.
This was another BBC related hack. The team decided to build a very simple practical scraper for the Magners Leage website. This scraper grabbed the league table for the league and this is useful because every week BBC Alba broadcast matches from the Magners League and do all sorts of visuals and on-screen graphics for them. On the website there is a table that updates the results but currently our data operators have to manually copy and grab this information in real time so that it can be shown as graphics on screen. It would obviously be preferable to do this automatically. So, we did it and tried it and it worked. It needs a wee bit more development but then we will be using it every week on TV! That’s it.
BME – Impact of the recession on education
Data journalism isn’t just about the data that is there but, sometimes, it’s about there not being enough data to go around. The team wanted to look at data on Black and Minority Ethnic people in Scotland. Educationally there are statistics that Africans are doing better than white or Indian counterparts educationally but employment rates do not reflect this. They wanted to look at statistics on specific groups in specific areas – so that you can get BME statistics at national and city levels. However currently data is not collected at all of the correct levels. There is also a real lack of data and hugely conflicting data about educational attainment. In general there seems to be a dearth of BME educational data.
So the story here is about the need for a data campaign. There is a need to build on crowdsourcing and communally appeal to the government to release the information. The Commission for Racial Equalities require companies to collect information but they do not require them to share it. There is also a black hole of data – illegal immigrants won’t be counted in the Census for instance – and there is a real opportunity for campaigners to progress this.
Searching for Edinburgh Councillors’ Gifts and Expenses
Edinburgh City Council has a good site to scrape since all the data is locked up in a PDF (and a health warning: James needs to double check the data so don’t assume it’s perfect!). The team started by looking at gifts and hospitality – bear in mind that this is self reported and inconsistent data.
We found some Highlights here: Sighthill Community Centre give the biggest gifts, Lothian Buses gave the most gifts out with more than 8 gifts (three of these to one councillor). Councillors claims and expenses were also pulled out – we found the highest and, happily, we found councillors with none. The Scraper is: James/Edinburgh Councillors but it was all a bit rushed but don’t blame us if inaccurate!
Judges and Prizing
The judges for the day were Jon Jacob from BBC College of Journalism, Allan Donald from STV and Huw Owen, Editor of Good Morning Scotland. They picked the Edinburgh Planning Map App as their top project from the day (with Fire Bugs second and Magners Cider second) and Fire Bugs as the most impressive scraper created. Prizes were handed out at the evening reception in the canteen/bar at BBC Scotland (which has a lovely view out onto the Clyde).
Reflections on the Day
The ScraperWiki Hacks and Hackers Day was a really interesting opportunity to look at humanly-accessible data that is not (yet) machine readable and think about what could be done with that data. One of the problems, however, with using scrapers for grabbing data in this way is that it is a workaround solution for inaccessible data where the structure may change without warning and whose owners have no responsibility for maintaining access.
There are huge possibilities for building useful tools for journalists and the wider public with that data but I did leave the day feeling that ScraperWiki is more useful as a campaigning tool to show the value of data than as a stable source of data feeds for production systems. What was built on the day did show what could be done if the data was provided, by the data owners, in discoverable, machine readable and interoperable formats (something the JISC RDTF project is working on) and/or made available via APIs.
The day did also raise some interesting questions about the future of journalism. The ScraperWiki team see data as a core part of journalism and expect journalists who are entering the profession from now onwards to be able to code or feel confident commissioning coding work from others in order to understand this data. I would not disagree with that but see the usefulness of hacking skills to a wider range of professionals. There is also work to do in raising awareness of how to “read” visualisations and data journalism – pretty images can be created from any data so you need to be sure you know what is being presented or compared and what that actually means.
However an interesting day and I felt relatively happy with my choice of “developer” badge – I didn’t do serious coding but I was able to help pick apart the data a bit and as one of very few women (and the only one with a developer badge I believe) I was pleased to fly the flag for keen female geeks everywhere!
:::: Update ::::
After sharing this post on Twitter Aine McGuire replied:
suchprettyeyes Nicola once a scraper is written (not easy i agree) then the data accumulates automatically@
Which has prompted me to add this note as I realise I did not, perhaps, properly explain how scrapers work and why they can be useful in ongoing situations. The idea of ScraperWiki is that once you have created a Scraper you can set it to run automatically on a regular basis – daily, weekly, etc. Effectively you create a script that scrapes web pages and then you run that script via a cron job in the cloud. That means that you can build web apps, tools, etc. that use that updated output data.
My concern, however, is that data providers don’t necessarily know that they are your data providers in these circumstances. That adds two possible vulnerabilities to tools built on scrapers:
- Firstly a scraper is trying to build structured data from a webpage so any kind of change to the design or formatting of that page, a new field in a form, a change of url, etc. could break the scraper and require a rewrite. If the people running the website you scrape are not aware of the impact of small changes you could face a constant need to tweak and rewrite scripts although my major concern here is that you produce data that is incorrect or less useful before you realise there is a problem. By contrast those providing APIs do at least know they are a data provider and, whilst that may not make them any more sensitive to the needs of third party developers, they can ensure that changes are only made when neccassary and are more likely to be documented accordingly.
- Secondly it is entirely possible, especially when creating your own structure for others’ data, to make small errors in interpretation or labelling of data. If the data provider knows that you are/may be using this data there is a potential for clarification or assistance in interpreting and getting the most accurate results from that data – this is one of the reasons that metadata (often including contact information) is so important on complex data sets for academia (such as those made available in ShareGeo). ScraperWiki allows you to reuse other people’s scrapers and this adds an additional layer of abstraction that could be problematic – although I think sharing of code is very positive as it allows collaborative work and provides really useful reference for those learning or trying to enhance their own code.
So, as before, I think that ScraperWiki is doing a great job as a guerilla data accessing tool but I still think that the aim should be for far more data to be provided directly by the data owners and I am hopeful that the types of PDF and similarly static documents we encountered at the hack day will, in time, instead appear as APIs, feeds and similar machine readable formats provided and documented by the data owners themselves.