R

Where's the script?

Althogh I wrote some automatically running R scripts, setting working directories always caused me a headache; unless I called Rscript from the right directory (which I coundn't do with cron), all script started with
setwd("folder/containing/the/script")
This was fine untill I had to share the code with other participants, as 1.: it wouldn't run on their computer, and 2.: it exposed the path on my computer, which I found disturbing. So, I came up with a solution.

The commandArgs function returns all the arguments of the running Rscript. Among them, there is the currently used .R file: "--file=running_script.R". I just had to get rid of the --file= part and give the result to dirname to get the answer to the original question. Here's the whole function as a gist:

Now I start every script with
setwd(get_script_dir())

Find all duplicated values in R

In the past few weeks, I had time, with only a few tasks to do, so I tried to optimize some of my functions.
For example, I needed a function which tells me which rows of a data.frame are duplicated. duplicated() is a really fast tool for this, but I needed something more: I had to know not only the repeating instances of the data, but the first one, too. It is useful if you have a database loaded as a data.frame, and you want to know where the items identified by a key differ.
The basic idea that you searc the repeating instances of the keys, then compare them to the whole original df. It can be pretty slow after getting the indices which are duplicated, then select the given rows and coloumns of the df, but the worst part is the comparing. You cannot just use %in%, because it does not works. After I realized this, I looked inside the duplicated(), what is in it?

> duplicated
function (x, incomparables = FALSE, ...)
UseMethod("duplicated")

Well, it did not help too much, but I learned that this is a method, which means if I call the function, itt will check its argument's class, and will try to call the appropriate function for it, in this case, the duplicated.data.frame(). So, I looked in that one:

> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...)
{
if (!identical(incomparables, FALSE))
.NotYetUsed("incomparables != FALSE")
if (length(x) != 1L)
duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
else duplicated(x[[1L]], fromLast = fromLast, ...)
}

That's the one. In the first if, it calls a .NotYetUsed(), which is unknown for me, but one can see it only shows a message (I don't know yet why and how this works here):

> .NotYetUsed
function (arg, error = TRUE)
{
msg <- gettextf("argument '%s' is not used (yet)", arg)
if (error)
stop(msg, domain = NA, call. = FALSE)
else warning(msg, domain = NA, call. = FALSE)
}

After the first if clause did not gave me much hint, the second become my true guide. It says that if the number of coloumns (because length.data.frame gives us that number) of the data.frame are not one (1=1L, for more information check this), paste the coloumns together, and seach duplicated on those, as the result will be a character vector, so duplicated.character will work for that. If the data.frame consists only of one coloumn, then extract it with "[[", it will give us a vector too, and the appropriate duplicated() can be used.

Note the trick at the pasting step: it uses a do.call(). I couldn't decode it fully, but somewhere in the StackOwerFlow - R help - R-bloggers trinagle I read that it creates one funcion call from its arguments, while the *apply family creates a function call for each item of its agruments. Somewhere here is the wichery. Although after collecting all this information, I decided to use the example of duplicated.data.frame(), and search the duplicated values in the pasted versions, then search the duplicated ones in the pasted too with the %in% operator (a.k.a. the match() function). Here is the code, which can be found on GitHub:

allDuplicated <- function(x){
# This function returns TRUE for the first occurrence, too.
x.pasted <- do.call("paste0", c(x))
d <- x.pasted[duplicated(x.pasted)]
return(x.pasted %in% d)
}

As usual, the code available on GitHub.

I also did a little benchmarking: I used a data.frame with 2 coloumns and 25492 rows, which had only 81 unique values. To repeat the duplicated() function 100 times with a for() loop, took 8.03 sec, while my allDuplicated() needed 8.83 sec, it was about a tenth slower. I also measured a 1000 times repetition: the overall time the duplicated() needed was 83.19, while the allDuplicaded ran in only 81.92 sec. I don't know, why, maybe because I looked at a notepad window while the duplicated() worked...

I tried it without disturbance, on a generated non-repeating same-dimensional dataset. This time the allDuplicated() needed less time to run 1000 times, only 103.63 sec compared to the 104.36 sec of the duplicated() function.Explanations are welcomed!

Scrape blogposts' data directly - Part III.

Okay, I just found R-Fiddle and I want to try it NOW, so I will show you how to finish the automated data collection from criticalmass.hu.

We know how to get data from one node. We need to know how to find all the nodes - and that is not so hard, as criticalmass.hu is almost purely Drupal, with only a little change in it, thus Drupal documentation can help us, or only just logic: we can access any post with the http://site-name/node/node-number. Yup, that easy. So all we have to do is automatically create the web addresses, and then run the previous function on it, then save the results.

I myself wrapped the to step in one data.frame: for data storage, I created a data.frame with three variables: id, author and time. The is the node-number, also the index of the rows in the data.frame. For this, we must know how much post we want to search, but today's internet connections allow us to overestimate.
The code is like: data.container <- data.frame(id=1:nodes, author=NA, time=NA), where nodes is a given positive integer. After this, we just only need to iterate over each row and wait. There will be some strange things, when there are pages not available or not found, but those can be filtered later.

This function can easily fail if the getURL() function does not get proper answer, in which case it throws an error which will cause the whole function to stop. This is why it is safer to write the data.frame into a csv file, and read them back at a restart. The full code can be seen below:

Scrape blogposts' data directly - Part II.

In the previous post, we almost decoded the time from a criticalmass.hu blogpost. We have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")

Now, we will extract the exact time as a POSIX object, then get the author's name as a string. Let's see!

Our time object is a list of one which contains a character vector: each element of this vector is a long string. We need the first string's last 16 character, which can be reached by splitting the whole string into individual characters, then take the last sixteen and bind them into one. To split the string into characters, we can use the strsplit() function again: time <- unlist(strsplit(unlist(splitted.source)[1], split="")). Note the nested unlist() function: it binds every elements of a list into one vector, if it is possible. Instead of this function, we could simply use time[[1]][1], which is basically the same. I used unlist() after strsplit()-ting the string, because in the next step I will need the number of the characters, and this way I can manage it in one step.
Now we have a bunch of separate characters in a one-element list, of which element we need the last sixteen character: time <- time[(length(time)-15) : length(time)]. Brackets are very important in this case, because the colon (:) operator is stronger than the multiplication, also than the sum and extraction. If we would leave those brackets from (length(time)-16), we would get a vector with decreasing values, from length(time)-15 to 0. We have to create one string from the 16 characters, and we are almost done with this. For making the string, we use the paste() finction, and especially its collapse argument: time <- paste(time, collapse="").
R can handle time codes, and can transform strings to time with the as.POSIXlt() or with the as.POSIXct() function. For more information, see ?as.POSIXlt.

How to get the user?

It is almost as easy as to get the time. From the previous post, we know that we need the string immediately in front of the string "</a></span>". And if we split the splitted.source's second element, the first part of it will store the username. So, the command looks like author <- strsplit(splitted.source[[1]][2], split="</a></span>"). We now have "=\"/tagok/erhardt-gergo-szeged\" title=\"Felhasználói profil megtekintése.\">erhardt.gergo_szeged" in the author[[1]][1]. We now only have to use strsplit() again: author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2].
This way we have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]

Let's create a function!

You will need only one input variable (the web address of the blogpost), and you have to return the values. As the time variable is character, you can do than in a vector:
getTimeAuthor <- function(blog.address){
page.source <- getURL(blog.address)
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]
return(c(author,time))
}

Scrape blogposts' data directly - Part I.

Prevoiusly I showed a way to get geocoordinets from Google with R. Although it could be better, tat will do for a one-time trial. This time, however, I show something a little bit more exciting - at least more exciting for me.

I needed some data from the site of the Hungarian Critical Mass movement. This page and the movement is (os was) a so called grassroots movement, the users are the authors of the site. Here emerges the shadow of the last months' NSA scandal: this site is open to public, it publishes information about the users who registered to use it, and were brave enough to write a post to the audience. How could we track, when, and how much a user were this brave?

First of all, you should be familiar with what you are looking for. Let's look at this post. We want to get its time of publising, and the author. If you can see it, you are good to go: the info is under the title "Szegednek, segítsetek!": "2006-04-16 22:47 — erhardt.gergo_szeged". Now we now what we are looking for, how could we get it? One can access the source code of the page which s/he actually watches, usually in a web browser the right click of the mouse will give you some option about source. Then you just have to search the desired text:
"<span class="submitted">v, 2006-04-16 22:47 — <a href="/tagok/erhardt-gergo-szeged" title="Felhasználói profil megtekintése.">erhardt.gergo_szeged</a></span> " We would like to get the part after "submitted" and the part before the "</a></span>" at the end.

Second: You need a device. In my case, this is R, which is quite good, but not perfect - but this imperfection can be healed by installing the RCurl package. It it is loaded, you can use the getURL() function. It returns the answer of the server to the given address. I say answer, because it is not always an HTML site's source code, it can be anything, and in the previous post it was a JSON object. This time it will be the source code we want, just try it with running the command getURL("http://criticalmass.hu/node/500"). You can see a bunch of characters. Maybe you should assign to them a variable name, for example: page.source <- getURL("http://criticalmass.hu/node/500").

How can you slice this pile of characters? R has the strsplit() function, which cuts every string in a vector by the given split parameter. First, I will show how to extract the date of the post:
Luckily, the date is always in the same format, and consumes 16 characters. I can chop down the characters immediately after the date and extract the last 16 characters. The string to chop is " — <a href" with the command time <- strsplit(page.source, split=" — <a href"). This will give us a list with one element, which is a two length character vector. It is a list, because strsplit() expects a vector of strings, and each splitted element will have an own list element. We have a vector which contains only one element, so strsplit() will return a list with one element. In that element, we will have the splitted string as a two-element vector. We will need the last 16 characters of the first element.

The code looks like this now:
page.source <- getURL("http://criticalmass.hu/node/500")
time <- strsplit(page.source, split=" — <a href")

In the next part I will show how to finish this function.

Get address coordinates with R

I wrote a finction in R which searces for an address given to it as a srting on GoogleMaps, and returns it's latitude and longitude.
Here's the code, which can be found on GitHub too.

IMPORTANT: It needs the rjson package (maybe RJSONIO is ok, too).
getAddressCoord <- function(address){
address <- as.character(address)
address <- paste(unlist(strsplit(address, split=" ")), collapse="+")
base <- "http://maps.googleapis.com/maps/api/geocode/json?address="
ending <- "&sensor=false"
download.file(url=paste(base, address, ending, sep=""),
destfile="temp_address_info.json")
address.info <- fromJSON(file="temp_address_info.json")
unlink("temp_address_info.json")
if(address.info[["status"]] == "OK"){
long.lat <- unlist(address.info$results[[1]]$geometry$location)
return(long.lat)
} else {
return("Result is not appropriate!")
}
}

First steps in Python

I'm a regular R user. I love R, because it is closer to a programming language than MS Excel, which I had to use before, and closer than SPSS too. In the latter case this only means that R is much more consistent than an SPSS Syntax.

I heard the name of Python a lot, but never tried it, because there were complicated tutorials only. I'm not a programmer, I know nothing about computers. I use Drupal because it is simple, it is not likely that I will write new moduls for it. I can use R, and what's more, about a week ago I wrote my forst S3 object, and maybe I could create a package, but it is still not likely. But then, eventually, how did I start using Python?

Since a few weeks ago I'm a trainee in a firm, and I was told i should learn python, because I will have to understand them. To do this, I got assistance: they halped me install and set the necessary things, obviously at this moment I wouldn't be able to repeat them. But finally, I have an interactive interface for Python, and a useful book to learn it from. I read the first 3-4 chapters and I found these:
- the intendation is important, you cannot use braces for blocks
- the primary prompt contains a space: ">>> "
- if you think about R, every operation is a function. Python has two different "function"-type: some of them don't need braces around arguments, others do.
- there are only one operator for assignment, and it is "="
- you cannot call the last row with the up cursor, although I didn't try page up and page down.

It is interesting for sure, and looks more hardware-close than R, hence it must be faster, I'm curios.

Tags: 

Comments