Yahoo Pipes is a website that allows you to extract, amend, and combine data from a large range of sites and services by dragging and dropping basic modules.
The best way to get a basic understanding is to watch this yahoo pipes introductory video.
Here is an overview of the yahoo pipes editor screen:
At this point you will need some basic knowledge of HTML.
To extract data from a website, all you need to do is use the Fetch Page module and plug in the URL of the site (if needed using the URL Builder module). As an example I will show you how to scrape the keywords from google hot trends.
You want to identify the html code before and after the list of data you are looking for. I just use the browsers’ view sourcecode function.
Before: <td width=”25%”>
You want to add these two items to the module.
Then select the module and press refresh on the Debugger window. If you have picked the right html code you should see the data you want to scrape by expanding the 0 just underneath “time taken” as per the screenshot below:
Then identify the html tag that is separating each unit of data by clicking on source and looking at the sourcecode. Often these are list items, ie <li>…</li> or url anchors <a>…</a>. After some trial and error I found that </a> as the delimiter to split the items got the best result:
Although in most instances your done at this point, in this particular case I am after the exact keyword, and none of the surrounding HTML. If you click on the ‘source’ link above the item you see what the problem is. Lots of other noise surrounding the piece i want:
<a rel=”nofollow” target=”_blank” href=”http://www.google.com/trends/hottrends?q=grandparents+day+2010&date=2010-9-12&sa=X”>grandparents day 2010</a>
You therefore need a second module to clean the data. For that use Operators > Regex. This will cycle through each of the items from the previous module and apply a regex to the item content.
Dont forget to connect the output of the Fetch Page module with the RegEx Module.
Regex is quit advanced stuff and I struggle at times. Check this site for a good online tool with explanations.
To extract the keyword here, the easiest for a novice like me is to just delete the elements of the content that you dont want. In this case all items are consistent, so its fairly easy.
So with the 1st regex rule, select the content upto …sa=X”>, and replace with nothing. To delete the end, select all text from </a… and replace that with nothing. Make sure the options are set as indicated in the screenshot below:
As a final step to get the list of keywords to show up when you run the module, you need to copy the item.content to item.title.
I also want the titles to link through to the google search for the keyword. To do this you need to add the google search url to each item followed by the keyword. I used Advanced Search, because the Instant Search was creating some URL problems in Yahoo Pipes.
To do this you create a Loop module, and insert into that a URL Builder module. Paste the full Google Advanced Search url (the one after you press enter) into the Base field and press Tab. This should split the URL into all the fields. Quickly remove all the ones that have no value by clicking the – button.
Then change the value of the as_q field by clicking the blue arrow next to the value field and select item.title. To ensure the link variable is created, type item.link in the assign-results-to field.
Finally your almost done. All you now need to do is connect the output from the RegEx module with the Pipe Output.
To save, just click the Save button in the right hand corner and add a Pipe name.
Then…just press Run Pipe…
It will look something like this:
You can test it your self here.
From here you can expand the pipe in a number of ways, for example by adding in some alexa data, google trends data, or for example google related terms. Alternatively you can use this pipe as an input to another pipe that for example check for domain availability.
That that for now.
I am working on a related guest post at the moment that will walk you through the process of creating your own URL scraper for scaling your PPV campaigns.
If you liked this post, or got some questions, leave a comment below…