blog

Scrape blogposts' data directly - Part III.

Okay, I just found R-Fiddle and I want to try it NOW, so I will show you how to finish the automated data collection from criticalmass.hu.

We know how to get data from one node. We need to know how to find all the nodes - and that is not so hard, as criticalmass.hu is almost purely Drupal, with only a little change in it, thus Drupal documentation can help us, or only just logic: we can access any post with the http://site-name/node/node-number. Yup, that easy. So all we have to do is automatically create the web addresses, and then run the previous function on it, then save the results.

I myself wrapped the to step in one data.frame: for data storage, I created a data.frame with three variables: id, author and time. The is the node-number, also the index of the rows in the data.frame. For this, we must know how much post we want to search, but today's internet connections allow us to overestimate.
The code is like: data.container <- data.frame(id=1:nodes, author=NA, time=NA), where nodes is a given positive integer. After this, we just only need to iterate over each row and wait. There will be some strange things, when there are pages not available or not found, but those can be filtered later.

This function can easily fail if the getURL() function does not get proper answer, in which case it throws an error which will cause the whole function to stop. This is why it is safer to write the data.frame into a csv file, and read them back at a restart. The full code can be seen below:

Scrape blogposts' data directly - Part II.

In the previous post, we almost decoded the time from a criticalmass.hu blogpost. We have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")

Now, we will extract the exact time as a POSIX object, then get the author's name as a string. Let's see!

Our time object is a list of one which contains a character vector: each element of this vector is a long string. We need the first string's last 16 character, which can be reached by splitting the whole string into individual characters, then take the last sixteen and bind them into one. To split the string into characters, we can use the strsplit() function again: time <- unlist(strsplit(unlist(splitted.source)[1], split="")). Note the nested unlist() function: it binds every elements of a list into one vector, if it is possible. Instead of this function, we could simply use time[[1]][1], which is basically the same. I used unlist() after strsplit()-ting the string, because in the next step I will need the number of the characters, and this way I can manage it in one step.
Now we have a bunch of separate characters in a one-element list, of which element we need the last sixteen character: time <- time[(length(time)-15) : length(time)]. Brackets are very important in this case, because the colon (:) operator is stronger than the multiplication, also than the sum and extraction. If we would leave those brackets from (length(time)-16), we would get a vector with decreasing values, from length(time)-15 to 0. We have to create one string from the 16 characters, and we are almost done with this. For making the string, we use the paste() finction, and especially its collapse argument: time <- paste(time, collapse="").
R can handle time codes, and can transform strings to time with the as.POSIXlt() or with the as.POSIXct() function. For more information, see ?as.POSIXlt.

How to get the user?

It is almost as easy as to get the time. From the previous post, we know that we need the string immediately in front of the string "</a></span>". And if we split the splitted.source's second element, the first part of it will store the username. So, the command looks like author <- strsplit(splitted.source[[1]][2], split="</a></span>"). We now have "=\"/tagok/erhardt-gergo-szeged\" title=\"Felhasználói profil megtekintése.\">erhardt.gergo_szeged" in the author[[1]][1]. We now only have to use strsplit() again: author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2].
This way we have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]

Let's create a function!

You will need only one input variable (the web address of the blogpost), and you have to return the values. As the time variable is character, you can do than in a vector:
getTimeAuthor <- function(blog.address){
page.source <- getURL(blog.address)
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]
return(c(author,time))
}

Scrape blogposts' data directly - Part I.

Prevoiusly I showed a way to get geocoordinets from Google with R. Although it could be better, tat will do for a one-time trial. This time, however, I show something a little bit more exciting - at least more exciting for me.

I needed some data from the site of the Hungarian Critical Mass movement. This page and the movement is (os was) a so called grassroots movement, the users are the authors of the site. Here emerges the shadow of the last months' NSA scandal: this site is open to public, it publishes information about the users who registered to use it, and were brave enough to write a post to the audience. How could we track, when, and how much a user were this brave?

First of all, you should be familiar with what you are looking for. Let's look at this post. We want to get its time of publising, and the author. If you can see it, you are good to go: the info is under the title "Szegednek, segítsetek!": "2006-04-16 22:47 — erhardt.gergo_szeged". Now we now what we are looking for, how could we get it? One can access the source code of the page which s/he actually watches, usually in a web browser the right click of the mouse will give you some option about source. Then you just have to search the desired text:
"<span class="submitted">v, 2006-04-16 22:47 — <a href="/tagok/erhardt-gergo-szeged" title="Felhasználói profil megtekintése.">erhardt.gergo_szeged</a></span> " We would like to get the part after "submitted" and the part before the "</a></span>" at the end.

Second: You need a device. In my case, this is R, which is quite good, but not perfect - but this imperfection can be healed by installing the RCurl package. It it is loaded, you can use the getURL() function. It returns the answer of the server to the given address. I say answer, because it is not always an HTML site's source code, it can be anything, and in the previous post it was a JSON object. This time it will be the source code we want, just try it with running the command getURL("http://criticalmass.hu/node/500"). You can see a bunch of characters. Maybe you should assign to them a variable name, for example: page.source <- getURL("http://criticalmass.hu/node/500").

How can you slice this pile of characters? R has the strsplit() function, which cuts every string in a vector by the given split parameter. First, I will show how to extract the date of the post:
Luckily, the date is always in the same format, and consumes 16 characters. I can chop down the characters immediately after the date and extract the last 16 characters. The string to chop is " — <a href" with the command time <- strsplit(page.source, split=" — <a href"). This will give us a list with one element, which is a two length character vector. It is a list, because strsplit() expects a vector of strings, and each splitted element will have an own list element. We have a vector which contains only one element, so strsplit() will return a list with one element. In that element, we will have the splitted string as a two-element vector. We will need the last 16 characters of the first element.

The code looks like this now:
page.source <- getURL("http://criticalmass.hu/node/500")
time <- strsplit(page.source, split=" — <a href")

In the next part I will show how to finish this function.

Comments