R-Fiddle

Scrape blogposts' data directly - Part III.

Okay, I just found R-Fiddle and I want to try it NOW, so I will show you how to finish the automated data collection from criticalmass.hu.

We know how to get data from one node. We need to know how to find all the nodes - and that is not so hard, as criticalmass.hu is almost purely Drupal, with only a little change in it, thus Drupal documentation can help us, or only just logic: we can access any post with the http://site-name/node/node-number. Yup, that easy. So all we have to do is automatically create the web addresses, and then run the previous function on it, then save the results.

I myself wrapped the to step in one data.frame: for data storage, I created a data.frame with three variables: id, author and time. The is the node-number, also the index of the rows in the data.frame. For this, we must know how much post we want to search, but today's internet connections allow us to overestimate.
The code is like: data.container <- data.frame(id=1:nodes, author=NA, time=NA), where nodes is a given positive integer. After this, we just only need to iterate over each row and wait. There will be some strange things, when there are pages not available or not found, but those can be filtered later.

This function can easily fail if the getURL() function does not get proper answer, in which case it throws an error which will cause the whole function to stop. This is why it is safer to write the data.frame into a csv file, and read them back at a restart. The full code can be seen below:

Comments