Scrape blogposts' data directly - Part I.

Prevoiusly I showed a way to get geocoordinets from Google with R. Although it could be better, tat will do for a one-time trial. This time, however, I show something a little bit more exciting - at least more exciting for me.

I needed some data from the site of the Hungarian Critical Mass movement. This page and the movement is (os was) a so called grassroots movement, the users are the authors of the site. Here emerges the shadow of the last months' NSA scandal: this site is open to public, it publishes information about the users who registered to use it, and were brave enough to write a post to the audience. How could we track, when, and how much a user were this brave?

First of all, you should be familiar with what you are looking for. Let's look at this post. We want to get its time of publising, and the author. If you can see it, you are good to go: the info is under the title "Szegednek, segítsetek!": "2006-04-16 22:47 — erhardt.gergo_szeged". Now we now what we are looking for, how could we get it? One can access the source code of the page which s/he actually watches, usually in a web browser the right click of the mouse will give you some option about source. Then you just have to search the desired text:
"<span class="submitted">v, 2006-04-16 22:47 — <a href="/tagok/erhardt-gergo-szeged" title="Felhasználói profil megtekintése.">erhardt.gergo_szeged</a></span> " We would like to get the part after "submitted" and the part before the "</a></span>" at the end.

Second: You need a device. In my case, this is R, which is quite good, but not perfect - but this imperfection can be healed by installing the RCurl package. It it is loaded, you can use the getURL() function. It returns the answer of the server to the given address. I say answer, because it is not always an HTML site's source code, it can be anything, and in the previous post it was a JSON object. This time it will be the source code we want, just try it with running the command getURL("http://criticalmass.hu/node/500"). You can see a bunch of characters. Maybe you should assign to them a variable name, for example: page.source <- getURL("http://criticalmass.hu/node/500").

How can you slice this pile of characters? R has the strsplit() function, which cuts every string in a vector by the given split parameter. First, I will show how to extract the date of the post:
Luckily, the date is always in the same format, and consumes 16 characters. I can chop down the characters immediately after the date and extract the last 16 characters. The string to chop is " — <a href" with the command time <- strsplit(page.source, split=" — <a href"). This will give us a list with one element, which is a two length character vector. It is a list, because strsplit() expects a vector of strings, and each splitted element will have an own list element. We have a vector which contains only one element, so strsplit() will return a list with one element. In that element, we will have the splitted string as a two-element vector. We will need the last 16 characters of the first element.

The code looks like this now:
page.source <- getURL("http://criticalmass.hu/node/500")
time <- strsplit(page.source, split=" — <a href")

In the next part I will show how to finish this function.

Comments