Wednesday 16 November 2011

16/11/2011 Scrapy

Today I've been experimenting with scrapy. Its outwith my projected plan for this week but I expect to be able to finish my other tasks as well.

So with scrapy I'm experimenting with the XPath selectors within the scrapy shell. I'm testing this on the BBC Food website to correctly extract data from the website in order to semi automate the recipe crawling.

Achieved so far


Get the title of the recipe - hxs.select('//h1/text()')
Get the ingredients h2 - hxs.select('//div[@id="ingredients"]/h2/text()')
Get the volume of ingredients - hxs.select('//dl[@id="stages"]/dd/ul/li/p/text()') (problem with fractions)
Get ingredients -  hxs.select('//dl[@id="stages"]/dd/ul/li/p/a/text()')

While doing this I've actually realised that the BBC Food website doesn't actually have that great a layout but its still achievable. The problem is that they don't really have the ingredients like this [quantity] [measurement] [ingredient] instead for example with [3 free-range] [eggs] its [quantity + some of the ingredient name] [rest of ingredient name]

On looking more into the formatting I've found that there is a microformat standard called hRecipe which some site use which would make crawling the data much simpler. I'll investigate this further.

Plan To Eat
RealSimple

Update 16:22 17/11/2011


I've worked a bit more on the crawler today and I've started to base it off realsimple.com as they follow the hRecipe microformat making data extraction a lot simpler. So far I've been able to crawl one page on the list of american recipes to extract the recipe name from the recipes page. First of all I'm going to work on crawling the other pages on the american recipes list which will involve adapting my rule then extract the rest of the data from these pages eg. the ingredients and recipe steps.

Crawler v1

No comments:

Post a Comment