Scraping Google Instant Search

by imrat on September 27, 2010

About 10 days ago I showed you how to build a simple Yahoo Pipe to scrape the Google Hot Trends. Today, I am going to take that tutorial to the next level.

You might want to use the hot keywords in your CPV campaign. Rather than manually copying each of the top 10 results, you should really automate this task.

Unfortunately, Yahoo Pipes does not work with Google.

The reason is that Yahoo Pipes supports the robot.txt setting which asks ‘spiders’ not to scrape.

No worries though. There is another excellent tool that you can use:

Open Dapper Tool

So today I will show you how you can easily scrape URLs from Google with Dapper. Its a simple and quick method that still works, even after Google switched on Instant Search.

For an introduction to Dapper, watch this video. Dapper is basically a simplified version of the Fetch Page module in Yahoo Pipes.

Note: throughout the next steps described below, a window can pop up offering you a demo of the next step in the process. Feel free to watch these to get to know how dapper works.

1. Switch off Google Instant

To scrape URLs with Dapper you will need a google search URL. You dont want to use the one from Google Instant. So switch it off by starting to type something in the search box. Then select the option to switch it off:

Select The Off Option To Switch Off Google Instant

Next, click the Advanced Search button, and select the option to return 100 results. Do your random search and grab the Google URL once the results are showing. If you want you can clean up the URL, but its not essential.

Note down the URL in a text document as you will need it later.

2. Create the Dapp

Next, log in to Dapper, and create a new Dapp. Then just paste the Google URL from the previous step into the ‘Enter the URL’ field  and press ‘Next step’.

This will then show you a window with the google search screen, where in the next step you will save a couple of pages.

3. Saving pages

To ensure that Dapper scrapes the right information, you need to teach it what text to grab. You will want to grab a couple of pages to make sure it does this correctly.

You need to add the page to dapper by clicking the ‘Add to Basket’ button. Then enter a different result in the search box, and press seach. Again, add this page to the basked. Do this once more.

Then click ‘Next Step’ in the left side bar.

4. Selecting content

You are now at the 3rd stage of the dapper scraper, where you select the content you want to scrape.

All you need to do is click the title of the first search result to select the content you want to scrape:

This should highlight all the 10 titles of each result in orange, and the tabs for each page you put in the basket should show 10 items like in the following diagram:

You now need to name the variable that you want to assign this scraped data to. At the bottom of your screen you see the scraped data and the ‘Save Field’ button. Click, name, and click the next step button in the left sidebar.
Don’t worry that when you have named the field that the selected content box on the left is empty – this basically means you have saved the data to the field.

5. Preview feed (nothing to do here)

In this case we dont have multiple variables so in step 4, Preview Feed, you dont need to do anything. Just click again on the next step button.

6. Saving & configuring your Dapp

Your are now ready to save your Dapp.Give it a recognisable name, ie Google Search.

Then, under the input variables heading, click edit next to ‘Query’. This will open a input box with the google URL. You will need to edit this URL to identify the query variable. The full url is below and the bits you want to edit:

q=searchkeyword
num=100

In the URL, replace this with {url} and {num}.

Click save next to the input box, and this should show the URL, with {url} and {num} highlighted in green, which means this is treated as an input variable.

Then click the Save button in the bottom right of the screen, and you will be taken to the Dapp results screen.

7. Getting ready to use your Dapp

To make sure your Dapp is working correctly, just enter a test URL under the ‘Set Input’ heading (see 1 in the screenshot) and press Update Input. This should show a list of links from related sites.

Next, you will need to choose your format. It is set to XML as default, so all that you need to do is click go to go to the screen to configure the xml feed.

What this does is create a URL that will load the search results and return them as a simple XML feed. This way you can easily import the results in Yahoo Pipes. Make sure you add some easily recognisable text to the url field, and click ‘Update Input’. This will update the XML url and add the required variables.

Now all you need to do is click on the XML URL to select it, and copy & paste. I recommend you copy and paste it into a text document so you have it for later reference.

Whats next

In most cases, once I have the XML url from Dapper, I use Yahoo Pipes to extract the data and modify & merge it with other sources. I am not going to cover that here, but keep your eyes out for a future Guest Post which covers this and many other tricks.
If you liked this post, I would really like it if you could leave a comment below.
Thanks
tijn
  • http://babystepsforward.com Joseph

    Hey, buddy! It’s Joseph from Traffic Dojo. Been subscribed to your RSS for a while now. This was a *really* great post. I’m extremely inspired!!

    I’ve always admired your pedagogical, technical and creative skillz. I was really feeling good about my progress lately. But… It seems I’m *still* 10 steps behind you!

    Damn you, nemesis!

    =)

  • http://imrat.com imrat

    Thanks joseph!! Its a shame I dont have more time to post more, but at least that way I keep the quality high. No doubt you will catch up soon though.

  • Muhnsoonerbate

    i am learning your google scraping technique but have had problems with the dapper once i copy the xml url into the urlbuilder and run it through the fetch data module. nothing shows up in the debugger. ive tryed it several times now with the same end result. i even followed your example dapp and still the same problem. i was wondering if you could give me some insite as to what could be going wrong.

    thanks the posts are great!!
    ken

  • http://twitter.com/pavlicko pavlicko

    nice post – shame google blocked it. Pretty flexible tool.

  • http://www.seodenver.com Zack Katz

    I’ve found a different way to scrape the Google Instant data – by calling for the javascript file instead. Check it out: http://pipes.yahoo.com/katzwebservices/googleinstantscraper

  • http://imrat.com imrat

    Very useful way of grabbing the google suggest keywords ;) Thanks for posting.

  • http://imrat.com imrat

    Its strange really. Sometimes it works sometimes it doesnt. I am thinking of just building a simple scraper myself and putting it on this blog for people to use. What do you think?

  • http://imrat.com imrat

    As @pavlicko suggests – dapper have made some changes which makes it no longer consistent when scraping google results. Ill see if I can find a different solution.

  • Nathaniel Taylor

    I would like to know if there is any other way of doing this?

    It seems like google is not allowing dapper to scrape anymore.

Previous post:

Next post: