Where's the script?

Althogh I wrote some automatically running R scripts, setting working directories always caused me a headache; unless I called Rscript from the right directory (which I coundn't do with cron), all script started with
setwd("folder/containing/the/script")
This was fine untill I had to share the code with other participants, as 1.: it wouldn't run on their computer, and 2.: it exposed the path on my computer, which I found disturbing. So, I came up with a solution.

The commandArgs function returns all the arguments of the running Rscript. Among them, there is the currently used .R file: "--file=running_script.R". I just had to get rid of the --file= part and give the result to dirname to get the answer to the original question. Here's the whole function as a gist:

Now I start every script with
setwd(get_script_dir())

Messing with sounds

I want to create a simple tool for generating sound, altered on user input. I figured out that I can generate wave with numpy, I can get user input with pygame and use pygame.sndarray and pygame.mixer to make the noise. Lets see part one, that

We are the noise generation!

I found a nice tutorial about generating sine wave with numpy, and I used that as a starting point. I defined a frequency and a sample rate, but instead of sine, I tried to make a simple saw, like this:
[0, 0.25, 0.5, 0.75, 1, 0, 0.25, 0.5, 0.75, 1, ...]

I used the linspace function from numpy to create one edge of the saw, which is as long as the frequencieth part of the sample:
fs = 44100
freq=440.
np.linspace(-1,1,fs/freq)

After reading the pygame documetation, I realized I should initialize the mixer module first, and check its properties:
import pygame
import pygame.mixer as mixer
import pygame.snarray as snd

mixer.init()
mixer.get_init()
# (22050, -16, 2)
So I needed to create an array with two coloumns. How?

First, I had to find a way to create two parallel saw edge. I achieved this with the tile and reshape funcion: a made two edges in line, thed cut it half.
import numpy as np
fs = mixer.get_init()[0]
chs = mixer.get_init()[2]
tst_sound = np.tile(np.linspace(-1,1,fs/freq), chs).reshape(fs/freq, chs)

I had already one short instance of sound, I only needed to repeat it freq times to make it last 1 sec. The problem is that the tile function repeats the number of coloumns, which is understandable, and made me found the transpose function. I swapped the rows and coloumns, repeated the new columns, and swapped them back.
tst_second = np.transpose(np.tile(np.transpose(tst_sound), freq))
tst_snd = snd.make_sound(tst_second)

If the shape of he sound is OK, the make_sound function should not return any problem. After I sorted out coloumn issue, it worked for me. Then I only needed to play it with the mixer module:
tst_snd.play()

Hell no! It's not the sound I want. It turns out, the mixer.get_init() returns three values with a purpose: it expects integers with the given numer of bits. So the type in which we want the array is:
new_type = "int"+str(abs(mixer.get_init()[1]))

To convert between types, there is the astype method:
tst_snd = tst_snd.astype(new_type)

Improvement

Getting more into documenttion, I found out that last argument of tile() accepted more than one dimension, so the code looks like this:
tst_sound = np.tile(np.linspace(-1.,1.,fs/frequency), (chs, 1))

Also, instead of transposing, I can use the axis argument of repeat(), to make it much more simple:
tst_second = np.repeat(tst_sound, frequency, 1)

We can pack all this into a function which takes the sample frequency and the number of channels directly from the mixer module:
def create_snd(frequency=440., duration=.3):
    # mixer init props:
    fs = mixer.get_init()[0]
    new_type = "int"+str(abs(mixer.get_init()[1]))
    chs = mixer.get_init()[2]
    # one wave
    tst_sound = np.tile(np.linspace(-1.,1.,fs/frequency), (chs, 1))
    # make in one second
    tst_second = np.repeat(tst_sound, frequency*duration, 1)
    # convert it to a sound and return it
    return(snd.make_sound(tst_second.astype(new_type)))

It cannot be played without a proper initializaton os pygame (I don't know why, probably reading the docs would help :D), so here are two short functions for playing a sine wave and manipulate its frequency:

def create_sin(frequency=440., duration=.3, amplitude=100):
    # mixer init props
    fs = mixer.get_init()[0]
    chs = mixer.get_init()[2]
    snd_type = "int"+str(abs(mixer.get_init()[1]))
    # time
    t = np.arange(0, duration, 1./fs)
    freq = np.array((frequency, frequency))
    sig = np.sin(2 * np.pi * freq * t.reshape(-1, 1))*amplitude
    return(snd.make_sound(sig.astype(snd_type)))
 
# program for changing notes
def int_snd_test():
    # program for changing notes
    #
    pg.init()
    pg.display.set_mode([50,50])
    mixer.pre_init(frequency=44100)
    mixer.init(frequency=44100, channels=1)
    #
    freq=440.
    done = False
    print("press w to increase, s to decrease frequency")
    #
    while not done:
        e = pg.event.wait()
        print("Waiting for user event")
        # create sound and play it:
        create_sin(frequency=freq, duration=1, amplitude=250).play()
        print("sound began to play...")
        # check user's events:
        if e.type == pg.KEYDOWN:
            if e.key == pg.K_ESCAPE:
                done = True
                pg.display.quit()
                break
            elif e.key == pg.K_w:
                freq += 100
                print("frequency up; ", freq)
            elif e.key == pg.K_s:
                freq -= 100
                print("frequency down;", freq)

However, some of the waves consist of two different sounds. Anyone has any idea why? I did not found in ot stack overflow.

Note that I added a duration argument to the function to be able make it shorter than one sec and amplitude variable becaouse the original was too silent.

Mouse under control

In the last few weeks I was tinkering on my new laptop to make it more comfortable. It's not easy at all. I mostly passed solved the problem of the sleeping (the lenovo thinkpad does not have display after waking up, and it is idle until I switch it off. The solution is to disable USB3.0, after that the machine will wake up, althoug the lights will signal something else). Anyway, the next task was to create a right mouse button on the upper side of the touchpad.

The problem basically is that on this ThinkPad there is no physical mouse buttons next to the rubber pointer, instead they are part of the touchpad. Thus when I disable that, I don't have mouse buttons. But the default settings was that both sides of the upper part of the taouchapd works as the left mouse button. After some digging, I found a nice question with good comments and answer on askubuntu, which describes exactly my problem. The only difference was that the user here wats more: to disable mouse movements induced by the touchpad (if I get it right, it was not working, so it is still an open question). Summary? copy the code on the site to a file names anyithing which begins with "99-" and ends with ".conf", e.g. "99-makerightmousebutton.conf", and place it in the /usr/share/X11/xorg.conf.d/ directory.

Code is under the section, and make sure to check the linked site above as this site will not show indentation.

Section "InputClass"
Identifier "t440 top buttons"
MatchDriver "synaptics"
Option "SoftButtonAreas" "60% 0 0 0 40% 60% 0 0" #Emulate right and midle buttons
Option "Synaptics Area" "0 0 0 1" #disable moving but not buttons
EndSection

And if someone would like to get one step closer to disable touchpad movements, copy this exact file to /etc/X11/xorg.conf.d/.

Kinect for Windows in da house! I mean on Lubuntu

Today I got my hands on a Kinect for Windows. But I only have XP and Lubuntu 12.04 on my machine, so I had a choice... and decided to try to make it work on the Linux. It was worth the time.

I checked 3 tutorials, two about freenect, one about OpenNi, but I could compile neither. This tutorial required some trial and error too, but eventually it provided a result - I could see some image.

Some comments on the instructable: I use Lubuntu 12.04, and it worked, however, you should always choose the freshest versions from both the Vrui and Kinect (now 3.1-002 and 2.8 respectively). Also, Vrui has a .sh script which failed to finish its job, so I recommend to download, then $make and $make install than run that script.

Fresh or whatever

At the weekend, as a gift for my graduation I updated the Drupal core of the site, as it was begging for it about six months. From 7.24 to 7.28. Of course, I was afraid for a good reason: I got Internal Server Error for the first trial again. the solution was not to update .htacces - if I remember correctly. I don't know how will I manage if I need to updat that one too.

After the site was on its feet again, I searched for a solution to the problem with color schemes: the site did not change its colos after I modified them in the color module, only made it fall apart. I found a solution to this problem also, There is a .htacces file somewhere in the "files" folder, where I commented out the options (the second and third rows). After this, I managed to change the colors to this beautiful warm yellow and brown combination (what an artistic choice, ah!). After this, I got my attention attached to the fact that my name (a.k.a. site name) disappeared from the page, which I thing is related to those settings - althoug it's not a great loss, I have bigger problems now.

My bigger concern is that the modules page still want me to update the core, as it says, I have 7.24. Well, I have not. Checked the status report: current version: 7.28, there is a recommended security update to 7.28. Of course, I found solution to this problem again (viva la help forums), another users fought with this phenomenon and found that if I would first update to 7.25, than 7.26, etc. Hmm... I will not do that now, because the system works perfectly.

Find all duplicated values in R

In the past few weeks, I had time, with only a few tasks to do, so I tried to optimize some of my functions.
For example, I needed a function which tells me which rows of a data.frame are duplicated. duplicated() is a really fast tool for this, but I needed something more: I had to know not only the repeating instances of the data, but the first one, too. It is useful if you have a database loaded as a data.frame, and you want to know where the items identified by a key differ.
The basic idea that you searc the repeating instances of the keys, then compare them to the whole original df. It can be pretty slow after getting the indices which are duplicated, then select the given rows and coloumns of the df, but the worst part is the comparing. You cannot just use %in%, because it does not works. After I realized this, I looked inside the duplicated(), what is in it?

> duplicated
function (x, incomparables = FALSE, ...)
UseMethod("duplicated")

Well, it did not help too much, but I learned that this is a method, which means if I call the function, itt will check its argument's class, and will try to call the appropriate function for it, in this case, the duplicated.data.frame(). So, I looked in that one:

> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...)
{
if (!identical(incomparables, FALSE))
.NotYetUsed("incomparables != FALSE")
if (length(x) != 1L)
duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
else duplicated(x[[1L]], fromLast = fromLast, ...)
}

That's the one. In the first if, it calls a .NotYetUsed(), which is unknown for me, but one can see it only shows a message (I don't know yet why and how this works here):

> .NotYetUsed
function (arg, error = TRUE)
{
msg <- gettextf("argument '%s' is not used (yet)", arg)
if (error)
stop(msg, domain = NA, call. = FALSE)
else warning(msg, domain = NA, call. = FALSE)
}

After the first if clause did not gave me much hint, the second become my true guide. It says that if the number of coloumns (because length.data.frame gives us that number) of the data.frame are not one (1=1L, for more information check this), paste the coloumns together, and seach duplicated on those, as the result will be a character vector, so duplicated.character will work for that. If the data.frame consists only of one coloumn, then extract it with "[[", it will give us a vector too, and the appropriate duplicated() can be used.

Note the trick at the pasting step: it uses a do.call(). I couldn't decode it fully, but somewhere in the StackOwerFlow - R help - R-bloggers trinagle I read that it creates one funcion call from its arguments, while the *apply family creates a function call for each item of its agruments. Somewhere here is the wichery. Although after collecting all this information, I decided to use the example of duplicated.data.frame(), and search the duplicated values in the pasted versions, then search the duplicated ones in the pasted too with the %in% operator (a.k.a. the match() function). Here is the code, which can be found on GitHub:

allDuplicated <- function(x){
# This function returns TRUE for the first occurrence, too.
x.pasted <- do.call("paste0", c(x))
d <- x.pasted[duplicated(x.pasted)]
return(x.pasted %in% d)
}

As usual, the code available on GitHub.

I also did a little benchmarking: I used a data.frame with 2 coloumns and 25492 rows, which had only 81 unique values. To repeat the duplicated() function 100 times with a for() loop, took 8.03 sec, while my allDuplicated() needed 8.83 sec, it was about a tenth slower. I also measured a 1000 times repetition: the overall time the duplicated() needed was 83.19, while the allDuplicaded ran in only 81.92 sec. I don't know, why, maybe because I looked at a notepad window while the duplicated() worked...

I tried it without disturbance, on a generated non-repeating same-dimensional dataset. This time the allDuplicated() needed less time to run 1000 times, only 103.63 sec compared to the 104.36 sec of the duplicated() function.Explanations are welcomed!

Scrape blogposts' data directly - Part III.

Okay, I just found R-Fiddle and I want to try it NOW, so I will show you how to finish the automated data collection from criticalmass.hu.

We know how to get data from one node. We need to know how to find all the nodes - and that is not so hard, as criticalmass.hu is almost purely Drupal, with only a little change in it, thus Drupal documentation can help us, or only just logic: we can access any post with the http://site-name/node/node-number. Yup, that easy. So all we have to do is automatically create the web addresses, and then run the previous function on it, then save the results.

I myself wrapped the to step in one data.frame: for data storage, I created a data.frame with three variables: id, author and time. The is the node-number, also the index of the rows in the data.frame. For this, we must know how much post we want to search, but today's internet connections allow us to overestimate.
The code is like: data.container <- data.frame(id=1:nodes, author=NA, time=NA), where nodes is a given positive integer. After this, we just only need to iterate over each row and wait. There will be some strange things, when there are pages not available or not found, but those can be filtered later.

This function can easily fail if the getURL() function does not get proper answer, in which case it throws an error which will cause the whole function to stop. This is why it is safer to write the data.frame into a csv file, and read them back at a restart. The full code can be seen below:

Scrape blogposts' data directly - Part II.

In the previous post, we almost decoded the time from a criticalmass.hu blogpost. We have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")

Now, we will extract the exact time as a POSIX object, then get the author's name as a string. Let's see!

Our time object is a list of one which contains a character vector: each element of this vector is a long string. We need the first string's last 16 character, which can be reached by splitting the whole string into individual characters, then take the last sixteen and bind them into one. To split the string into characters, we can use the strsplit() function again: time <- unlist(strsplit(unlist(splitted.source)[1], split="")). Note the nested unlist() function: it binds every elements of a list into one vector, if it is possible. Instead of this function, we could simply use time[[1]][1], which is basically the same. I used unlist() after strsplit()-ting the string, because in the next step I will need the number of the characters, and this way I can manage it in one step.
Now we have a bunch of separate characters in a one-element list, of which element we need the last sixteen character: time <- time[(length(time)-15) : length(time)]. Brackets are very important in this case, because the colon (:) operator is stronger than the multiplication, also than the sum and extraction. If we would leave those brackets from (length(time)-16), we would get a vector with decreasing values, from length(time)-15 to 0. We have to create one string from the 16 characters, and we are almost done with this. For making the string, we use the paste() finction, and especially its collapse argument: time <- paste(time, collapse="").
R can handle time codes, and can transform strings to time with the as.POSIXlt() or with the as.POSIXct() function. For more information, see ?as.POSIXlt.

How to get the user?

It is almost as easy as to get the time. From the previous post, we know that we need the string immediately in front of the string "</a></span>". And if we split the splitted.source's second element, the first part of it will store the username. So, the command looks like author <- strsplit(splitted.source[[1]][2], split="</a></span>"). We now have "=\"/tagok/erhardt-gergo-szeged\" title=\"Felhasználói profil megtekintése.\">erhardt.gergo_szeged" in the author[[1]][1]. We now only have to use strsplit() again: author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2].
This way we have the following code:
library(RCurl)
page.source <- getURL("http://criticalmass.hu/node/500")
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]

Let's create a function!

You will need only one input variable (the web address of the blogpost), and you have to return the values. As the time variable is character, you can do than in a vector:
getTimeAuthor <- function(blog.address){
page.source <- getURL(blog.address)
splitted.source <- strsplit(page.source, split=" — <a href")
time <- unlist(strsplit(unlist(splitted.source)[1], split=""))
time <- time[(length(time)-16) : length(time)]
time <- paste(time, collapse="")
author <- strsplit(splitted.source[[1]][2], split="</a></span>")
author <- unlist(strsplit(author[[1]][1], split="megtekintése.\">"))[2]
return(c(author,time))
}

Scrape blogposts' data directly - Part I.

Prevoiusly I showed a way to get geocoordinets from Google with R. Although it could be better, tat will do for a one-time trial. This time, however, I show something a little bit more exciting - at least more exciting for me.

I needed some data from the site of the Hungarian Critical Mass movement. This page and the movement is (os was) a so called grassroots movement, the users are the authors of the site. Here emerges the shadow of the last months' NSA scandal: this site is open to public, it publishes information about the users who registered to use it, and were brave enough to write a post to the audience. How could we track, when, and how much a user were this brave?

First of all, you should be familiar with what you are looking for. Let's look at this post. We want to get its time of publising, and the author. If you can see it, you are good to go: the info is under the title "Szegednek, segítsetek!": "2006-04-16 22:47 — erhardt.gergo_szeged". Now we now what we are looking for, how could we get it? One can access the source code of the page which s/he actually watches, usually in a web browser the right click of the mouse will give you some option about source. Then you just have to search the desired text:
"<span class="submitted">v, 2006-04-16 22:47 — <a href="/tagok/erhardt-gergo-szeged" title="Felhasználói profil megtekintése.">erhardt.gergo_szeged</a></span> " We would like to get the part after "submitted" and the part before the "</a></span>" at the end.

Second: You need a device. In my case, this is R, which is quite good, but not perfect - but this imperfection can be healed by installing the RCurl package. It it is loaded, you can use the getURL() function. It returns the answer of the server to the given address. I say answer, because it is not always an HTML site's source code, it can be anything, and in the previous post it was a JSON object. This time it will be the source code we want, just try it with running the command getURL("http://criticalmass.hu/node/500"). You can see a bunch of characters. Maybe you should assign to them a variable name, for example: page.source <- getURL("http://criticalmass.hu/node/500").

How can you slice this pile of characters? R has the strsplit() function, which cuts every string in a vector by the given split parameter. First, I will show how to extract the date of the post:
Luckily, the date is always in the same format, and consumes 16 characters. I can chop down the characters immediately after the date and extract the last 16 characters. The string to chop is " — <a href" with the command time <- strsplit(page.source, split=" — <a href"). This will give us a list with one element, which is a two length character vector. It is a list, because strsplit() expects a vector of strings, and each splitted element will have an own list element. We have a vector which contains only one element, so strsplit() will return a list with one element. In that element, we will have the splitted string as a two-element vector. We will need the last 16 characters of the first element.

The code looks like this now:
page.source <- getURL("http://criticalmass.hu/node/500")
time <- strsplit(page.source, split=" — <a href")

In the next part I will show how to finish this function.

Pages