Home > Awk, Genomic Selection, Linux, QTL-detection, Shell > Learning by examples (1) : Data download

Learning by examples (1) : Data download

Two (or three) things happened to enter into my TODO list (I could have decided to focus only on the most important task, but okay let’s try to manage all at once

  1. I have to set up some sample data in order to help people in working on the (hopefully) close to be declared “in production” cluster
  2. We discussed recently with colleagues about  the idea of testing the simulated data available for the coming 15th QTLMAS Workshop held in Rennes
  3. I’d like to present samples shell script code that help automated retrieval of data

So let’s start by the first step of the task data download, I just had to run this little script to obtain all the simulated data available on the site.

#Declare the page containing link to gzip files
Add=https://colloque.inra.fr/qtlmas/Home-page/News/The-simulated-data-set
#First retrieves the website page

wget -O webpage $Add
#Second find all the adress to the gziped files
for File in `gawk -F \" '/.gz"/{print $2}' webpage`
do
 wget $File
done
#gunzip the files
gunzip *.gz
#Clean the directory before leaving
rm webpage *.gz

Explanation :

First I went (manually :-s) on the web site to find the webpage where link to the data were available. The address is then declared as a variable “add”

Second, I use  wget (generally available on most of the Linux distro), the latter download the page/document indicated in the url (here the url is “Add”) , thanks to  -O I indicate the name of the file where download should be stored (here it will be webpage).

Then we know that among all the text contained in the file “webpage”, we only want to find those containing an address to a gzip file (so containing a “.gz” extension) . As we are seeking for something declared with html code, we  know that it’s highly probable that this address will be contained within the html code :

<a href="something.gz">

We therefore tell gawk that the field separator is (“) , with the help of the flag -F.
So basically gawk will first look at all the lines containing the character .gz, (this is the part /.gz/ in our instructions), then with these lines, I only print the second field.
I know we are pretty happy to have a well written web page, with one line per file to download, it could have been much more tricky, but let’s say, that simple case can also happen !

All the awk instruction is contained within two inverted bracket (the one obtained with alt+gr 7), this will just ask to execute the expression, and all the output will then be redirected to the variable File in the for loop.
Variable File will alternatively take as a value the address were the file to download are. Wget will download them.

At last, we only have to gunzip the file and leave the place as clean as possible.

Isn’t it marvelous ? Next step make some changes to the data so that they’ll fit our expected data format.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: