So with scrapy I'm experimenting with the XPath selectors within the scrapy shell. I'm testing this on the BBC Food website to correctly extract data from the website in order to semi automate the recipe crawling.
Achieved so far
Get the title of the recipe - hxs.select('//h1/text()')
Get the ingredients h2 - hxs.select('//div[@id="ingredients"]/h2/text()')
Get the volume of ingredients - hxs.select('//dl[@id="stages"]/dd/ul/li/p/text()') (problem with fractions)
Get ingredients - hxs.select('//dl[@id="stages"]/dd/ul/li/p/a/text()')
While doing this I've actually realised that the BBC Food website doesn't actually have that great a layout but its still achievable. The problem is that they don't really have the ingredients like this [quantity] [measurement] [ingredient] instead for example with [3 free-range] [eggs] its [quantity + some of the ingredient name] [rest of ingredient name]
On looking more into the formatting I've found that there is a microformat standard called hRecipe which some site use which would make crawling the data much simpler. I'll investigate this further.
Plan To Eat |
RealSimple |
Update 16:22 17/11/2011
I've worked a bit more on the crawler today and I've started to base it off realsimple.com as they follow the hRecipe microformat making data extraction a lot simpler. So far I've been able to crawl one page on the list of american recipes to extract the recipe name from the recipes page. First of all I'm going to work on crawling the other pages on the american recipes list which will involve adapting my rule then extract the rest of the data from these pages eg. the ingredients and recipe steps.
I've worked a bit more on the crawler today and I've started to base it off realsimple.com as they follow the hRecipe microformat making data extraction a lot simpler. So far I've been able to crawl one page on the list of american recipes to extract the recipe name from the recipes page. First of all I'm going to work on crawling the other pages on the american recipes list which will involve adapting my rule then extract the rest of the data from these pages eg. the ingredients and recipe steps.
Crawler v1 |