Sunday, 10 January 2010

How to get BBC Travel updates via RSS using Yahoo Pipes

Here's a bit of a departure from my normal blogging content, sporadic though it is.

I've just been at university and while I was there I got an email from a colleague asking about good examples of transport content for local government websites. I didn't throw the query out to Twitter particularly well as the responses I got were about examples of dynamic travel news such as the Highways Agency Clearspring/GovDelivery widget or Godalming's repurposing of the same content to give geographical proximity.

Yesterday I was looking at how I might get back to York because of the weather. During the afternoon's lecture, cursing my stupidity at not leaving at lunchtime, I visited the BBC and discovered, to my surprise that their traffic details offer nothing in the way of subscription.

With plenty of time on trains, platforms and coaches to tinker I thought I'd see if I could manage to do something about that. The terms of the BBC's travel feed are that they are for personal and non-commercial usage so if you want to be able to get the latest information for yourselves then here's how to do it very simply.



Visit this Yahoo Pipe, enter the relevant locality or service and click Run. The wonderful thing about Yahoo Pipes is that it will then give you an RSS feed, or with a quick click of a button a badge you can put onto your own blog.

But maybe you want to get to grips with what's going on behind the scenes, so here's a quick introduction to the world of Yahoo Pipes.

Now, I love technology and have some basic knowledge about php, html and css that's faded over time but I've found Pipes to be a brilliant tool for doing a whole host of things. This might not be perfect but it does work! Obviously the BBC don't want this stuff being used commercially because they pay for it but if you'd like to build this pipe yourself here's how I did it.

Step 1
The first thing to do is extract the data from the BBC. The irony is that the BBC actually use RSS to populate the page but don't expose it for syndication. The format of the feed is http://www.bbc.co.uk/travelnews/local/york.shtml. 'york' is the only part that changes for each different feed.

Step 2
So we need to build that URL. To do that I created a new pipe and selected User inputs > Text input. The 'name' field designates what these things will be called; the 'prompt' is the text displayed alongside the empty entry boxes when you run the pipe; the position organises where the input is displayed; the 'default' is what the field contains automatically; and 'debug' is the content used by the pipe in its design state when you're testing it.




Step 3
This provides the area information, in my case york. The BBC feed needs to have .shtml added to the end so we use String > String Builder (and the Highways requires .xml and follows the same principles). We need to connect the String Builder to the Text input box and this is where the name 'pipes' comes from. By clicking on the circular connector and dragging it to another it will connect, or wire, them together. Having done that click the '+' to add another part to the string, the .shtml or .xml.



Step 4
Now we can finish the URL itself. To do that we need URL > URL Builder.

The 'base' is the URL we're tacking the string we've created onto. For us this is:
http://www.bbc.co.uk/travelnews/local and
http://www.highways.gov.uk/rssfeed

The 'path elements' is what we've just made, so wire them together. In this instance there are no 'Query parameters' so just ignore that part.



Step 5
Having got the source data URL we need to fetch it. The Highways Agency is already in the right format so we need only use Sources > Fetch Feed and wire it into the URL. For the time being nothing more needs to be done to the Highways Agency feed so we'll come back to it.

The BBC content is more complicated. We need to fetch the page Sources > Fetch Page and then cut the information from within the page. First of all we wire the URL and the URL Builder together. Having looked at the source of the page the information we're interested in is between these two pieces of html: and . So we cut content from one, to the other.

Because the page is structured using a table each individual piece of information is within a table row, or and so (the end of each table row) is our 'delimiter' (the term that separates one piece of content from another.



Step 6
This is where things become considerably more complicated but I'll try to explain it as simply as possible. Add an Operators > Regex module (short for Regular Expression), this takes a piece of code and, according to what you tell it, will repurpose it. The data from the BBC is written to be displayed as a website, not as a feed so contains html and other formatting information. We want to get rid of it.

So, 'item.content' needs tidying up. This piece of regex module removes formatting instructions such as bold, italic and font size. In all cases we want to remove any mention of them so tick the 'g' for 'global matching'.

The other thing we want to remove is the initial code that labels each table row with a unique reference that ends : name="3469238">. The regex '\">' removes everything until it comes across the exact combination of "> and so takes that initial code away.

In the image you'll see checkboxes marked g, s, m and i. Most of the g boxes are ticked this allows global matching, so all instances of a string are covered.




Step 7
Now the content needs to be made into a series of separate parts. We do that using the Operators > Rename. Doing this splits the content into 'title', 'description' and 'time' so that we can duplicate the information and build the final items for our feed.



Step 8
Having created those some more regex is required to restrict the content of each part. You have to analyse the source to see where the breaks, or the changes need to be made. I was using the most basic regex '.+', the full stop represents any character and the plus sign refers to any number of the preceding character.

I decided to extract the severity image, Road Name and Location as the title. The way the source code was written meant this required removing everything after the first line break (<BR>). The regex to do this is <BR>.+.

The item.description was next up. The relevant data was separated in the code by a line break preceded by a space ( <BR>), the regex was consequently ".+ <BR>". The second thing I did here was remove an errant comma.

I also wanted to pull out the time of the update so that the feed can be sorted at the end. This data could be separated from the remainder of the content by a double line break (<BR><BR>). The regex for this was .+<BR><BR>.

On this occasion, the s check boxes are marked. These allow the '.' to match across newlines which is needed with this data because the html source code of the BBC page is split across a lot of them.



Step 9
As far as I understand RSS, which isn't particularly technically, they need a pubDate in order to publish properly and to sort. At the moment the feed doesn't have one. However, what was just done to item.time has put it into an acceptable format so by renaming item.time to item.pubDate this will work and allow a way of sorting the feed to make sure the most recent content is seen first.



Step 10
The module Operators > Union will bring multiple feeds together which means wiring up the original Highways Agency feed and the Rename module.

Step 11
Operators > Sort takes both those feeds and sorts them. In this instance by item.pubDate and in descending order (most recent first)

Step 12
And that's it, you can wire it all together and publish the pipe. Click save and then run it.



Hope that's useful to someone! While it works there may be better ways of doing it so if you can help me learn how to do that I'd be very interested.

No comments: