Scrape blogposts' data directly - Part II.

In the previous post, we almost decoded the time from a criticalmass.hu blogpost. We have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")

Now, we will extract the exact time as a POSIX object, then get the author's name as a string. Let's see!

Our time object is a list of one which contains a character vector: each element of this vector is a long string. We need the first string's last 16 character, which can be reached by splitting the whole string into individual characters, then take the last sixteen and bind them into one. To split the string into characters, we can use the strsplit() function again: time <- unlist(strsplit(unlist(splitted.source)[1], split="")). Note the nested unlist() function: it binds every elements of a list into one vector, if it is possible. Instead of this function, we could simply use time[[1]][1], which is basically the same. I used unlist() after strsplit()-ting the string, because in the next step I will need the number of the characters, and this way I can manage it in one step.
Now we have a bunch of separate characters in a one-element list, of which element we need the last sixteen character: time <- time[(length(time)-15) : length(time)]. Brackets are very important in this case, because the colon (:) operator is stronger than the multiplication, also than the sum and extraction. If we would leave those brackets from (length(time)-16), we would get a vector with decreasing values, from length(time)-15 to 0. We have to create one string from the 16 characters, and we are almost done with this. For making the string, we use the paste() finction, and especially its collapse argument: time <- paste(time, collapse="").
R can handle time codes, and can transform strings to time with the as.POSIXlt() or with the as.POSIXct() function. For more information, see ?as.POSIXlt.

How to get the user?

It is almost as easy as to get the time. From the previous post, we know that we need the string immediately in front of the string "</a></span>". And if we split the splitted.source's second element, the first part of it will store the username. So, the command looks like author <- strsplit(splitted.source[[1]][2], split="</a></span>"). We now have "=\"/tagok/erhardt-gergo-szeged\" title=\"Felhasználói profil megtekintése.\">erhardt.gergo_szeged" in the author[[1]][1]. We now only have to use strsplit() again: author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2].
This way we have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]

Let's create a function!

You will need only one input variable (the web address of the blogpost), and you have to return the values. As the time variable is character, you can do than in a vector:
getTimeAuthor <- function(blog.address){
page.source <- getURL(blog.address)
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]
return(c(author,time))
}

Comments