Gary Short Honours Project

Wednesday, 16 November 2011

16/11/2011 Scrapy

Today I've been experimenting with scrapy. Its outwith my projected plan for this week but I expect to be able to finish my other tasks as well.

So with scrapy I'm experimenting with the XPath selectors within the scrapy shell. I'm testing this on the BBC Food website to correctly extract data from the website in order to semi automate the recipe crawling.

Achieved so far

Get the title of the recipe - hxs.select('//h1/text()')
Get the ingredients h2 - hxs.select('//div[@id="ingredients"]/h2/text()')
Get the volume of ingredients - hxs.select('//dl[@id="stages"]/dd/ul/li/p/text()') (problem with fractions)
Get ingredients - hxs.select('//dl[@id="stages"]/dd/ul/li/p/a/text()')

While doing this I've actually realised that the BBC Food website doesn't actually have that great a layout but its still achievable. The problem is that they don't really have the ingredients like this [quantity] [measurement] [ingredient] instead for example with [3 free-range] [eggs] its [quantity + some of the ingredient name] [rest of ingredient name]

On looking more into the formatting I've found that there is a microformat standard called hRecipe which some site use which would make crawling the data much simpler. I'll investigate this further.

Plan To Eat

RealSimple

Update 16:22 17/11/2011

I've worked a bit more on the crawler today and I've started to base it off realsimple.com as they follow the hRecipe microformat making data extraction a lot simpler. So far I've been able to crawl one page on the list of american recipes to extract the recipe name from the recipes page. First of all I'm going to work on crawling the other pages on the american recipes list which will involve adapting my rule then extract the rest of the data from these pages eg. the ingredients and recipe steps.

Crawler v1

Thursday, 10 November 2011

This Week 7-11 Novemeber

Development Environment

Set up my development environment for my honours project. It is a LAMP setup running on a virtual machine on VirtualBox. It is a TurnkeyLinux distro based of Ubuntu 10.04. Its incredibly lightweight and highly customisable, therefor should be a perfect based for my project.

Its accessible through http://garyshort.dyndns.org

Meeting With Nirmalie

My tasks for this this week is to finishing completing a test for the similarity algorithm. This consists of importing my test data for 50 recipes into a MySQL database, converting the similarity algorithm from last years coursework from PL/SQL and running the similarity algorithm for two recipes and record the results.

My tasks for next week is implementing the above into PHP. This consists of creating a PHP script that connects to the MySQL database and is able to run a saved function.

Further Research

A more detailed look into the coding of a crawler within scrapy to allow me to (semi) automate the importation of recipes from the WWW into my MySQL database.

Found a possible base for the site in that of userCake. Its an open source PHP user management system with a MySQL database backend. I will be doing more research into this but it looks well written and highly customisable.

Test version of userCake - http://garyshort.dyndns.org/

Wednesday, 9 November 2011

Week 4 - Literature (Reasoning)

This week I have been reading through available papers on the ACM Digital Library that come under the category of 'Recipe Recommender'. I have found a few interesting papers in this field.

I've noticed that in these papers the recommender systems focus on recommending recipes based on a recipe or recommending a healthy recipe due to the increasing unhealthiness worldwide. From reading these papers I have come to learn that my honours project is very unique as the issue I am trying to tackle is the astonishing amount of food waste worldwide and to try and reduce this level by suggesting recipes that can use the ingredients you have lying about to save them going by there sell by date and getting thrown away. I also feel that with cooking food from scratch instead of using freely available ready meals which tend to be high in salt and saturated fats helps indirectly raise individual knowledge of healthy eating to the individual using a recommender system like mine.

Going back to food waste I was astonished by the figures in Scotland alone! See below:

With the ever rising energy prices, food prices and impact on the environment such a simple thing as reducing food waste can help reduce the impact on these issues.

With new figures on the health statistics in Scotland published I was quite frankly scared by what these figures showed. In a developed country like Scotland the life expectancy for someone from the most deprived area is on average 55 which is similar to that of Ethiopia and Kenya.

I've attached the papers which I've read and highlighted the points I thought were interesting.

Deriving a Recipe Similarity Measure for Recommending Healthful Meals

Designing and Evaluating Kalas: A Social Navigation System for Food Recipes

Intelligent Food Planning: Personalized Recipe Recommendation

Monday, 7 November 2011

Week 3 - Technology Anlysis

Still to publish

Wednesday, 2 November 2011

Week 2 - Technology Research

Front End

PHP5, CSS3, javascript and HTML5 will be used to create the front end application. I thought it was funamental to work with the latest technologies.
HTML5 will be based off HTML5 Boilerplate to ensure good standards and optimal compatability.
The feature Geolocation in HTML5 can be used to pinpoint the closest Tesco store to the user.
javascript(JSON) will be used for the Tesco API data.
Possibility of using the PHP library flourish to speed up development time and make the project more secure.

Back-end

Database back-end will be MySQL5. After looking at the other available databases systems like sqlite and PostgreSQL it seemed logical to go with MySQL as the functionality is very similar to Oracle, very good performance with large amounts of data, no costs as it is free, plethora of support documentation online and previous experience with MySQL.
Apache will be used due to its extendibility and support.

PHP Crawler

Design my own PHP crawler to crawl BBC Food website and collect data. Once data is collected either insert automatically into MySQL database or export to an XML/SQL file.
Use the python crawler framework Scrapy to crawl BBC Food website and collect recipe data. With Scrapy framework extract the data feed to XML and then create a PHP XML parser to insert the data into the database.

Mobile Application

Use of PhoneGap API to allow easy porting to the different mobile platforms. This allows the ability to use native features on mobile phones like geolocation.
Design the desktop site in such a manner that it detects its mobile browser and is rendered correctly for a mobile device.

Week 1 - Initial Project Specification

I've created my initial project specification based on my initial ideas. I've done some research and found a computer cooking competition carried out by the ICCBR and have brought the ideas and fundamentals of this into my initial project specification.

Initial Project Spec.

Week 1 - Project Details

For my honours project I've decided to create a recipe recommender system to recommend recipes based on the ingredients you have in your kitchen. Below are my aims and objectives for my project.

Aims/Objectives

Conduct a survey of literature and relevant recipe related applications
Develop a crawler to gather recipes and related content such as images and videos. Design and implement a recommender system using MySQL
Design and implement the web front-end to gather user queries.
Integrate the Tesco API to allow additional functionality such as formulating a purchase basket consisting of missing ingredients
Design and conduct a small-scale user trial to evaluate the recipe recommender system.

Additional Features

Mobile application with the same functionality as the desktop version.
Geolocation to find the closest Supermarket/Tesco store.
Ability to upload videos/pictures of the recipe.
Recipe adaptability to substitute ingredients for ones that the user has.
Share recipes over social networks.
Data from Tesco API will be used to personalise ingredients being purchased based on a budget.