introduction
The COVID-19 pandemic accentuated the importance of home-cooking with surveys showing that 73% of consumers say that cooking at home makes them feel accomplished, and 70% of consumers say they will continue cooking at home after the pandemic. Riding on this popularity wave, companies in the broader food industry have found that promoting their products through ‘shoppable recipes’ on online recipe websites has positive returns on actual sales.
By studying a network of recipes, their attributes, and users’ interactions with the recipes on Food.com, currently the 6th largest cooking and recipe website in the U.S, we hope to recommend to our client, a US-based baking ingredient supplier, a suitable user profile that will allow our client to maximise the number of impressions and spread awareness of their new product line among the Food.com network.
The business problem was translated into an analytics process with a series of key questions and corresponding data science tasks. The packages and tools used included NetworkX, Gephi, Matplotlib, NLTK,CDLIB, Gensim, and WordCloud.
- Who are the most important recipe contributors? – Top users by PageRank score
- What communities of interests exist within the network? – Bipartite Graphs
- What recipe topics are popular? – Topic Modelling
The Dataset
The dataset used was obtained from Kaggle and consists of raw data from 231,637 recipes (Recipe Table) and 226,570 users on Food.com, with 1,132,367 user-recipes interactions (Interaction Table) spanning 18 years from 2000 to 2018. Key information include recipes contributors, recipes preparation time, tags used, nutrition, steps, users’ ratings, review text, and date of interactions. For the purpose of our analyses, we extracted all user-recipes interactions for 4 years: 2003, 2008, 2013 & 2018, totalling 270,391 rows.
Exploratory Data Analysis
A preliminary visualisation of the distribution of key fields e.g. ratings, contributors, steps, and number of ingredients was done to gain initial intuition for the dataset. It observed that the popularity of the platform appeared to fall after 2008 with the total number of annual interactions falling year on year. Additionally, Plots of Recipe count by Contributor ID and User interactions by Contributor ID showed that the majority of contributions and interactions are focused on the top few percentile of contributors and their recipes, justifying our rationale for focussing subsequent analysis on Top 10 contributors to reveal the most important properties and characteristics of the network.
NETWORK INFLUENCE: PageRank Centrality
To assess the importance of users in the above network, we employed PageRank centrality because it accounts for both the quality of the node, and its absolute reach. In this scenario, each node would represent one user in the community.
Invented by Google founders Larry Page and Sergei Brin, PageRank centrality is a variant of EigenCentrality designed for ranking web content, using hyperlinks between pages as a measure of importance. PageRank’s main difference from EigenCentrality is that it accounts for link direction. Each node in a network is assigned a score based on its number of incoming links (its ‘indegree’). These links are also weighted depending on the relative score of its originating node. he result is that nodes with many incoming links are influential, and nodes to which they are connected share some of that influence.
In our scenario, a user is deemed important when he/she has contributed significantly in the platform with many other users interacting with his/her recipes (high in-degrees). Additionally, more interactions coming from other important users should reflect the quality of his/her contributions, hence the importance in the network should increase. The calculations for the 4 years were conducted by NetworkX’s built-in pagerank method with `weight` set to be the number of interactions (n_interactions), to obtain the list of top-10 PageRank users in each year for further analysis.
By using the PageRank built-in function from NetworkX on 4 User-User directed graphs followed by ranking of scores, the following is the User ID of the top 10 contributors in each year, determined by the amount and weights of interactions they gathered.
- Year 2003: [37449, 21752, 10404, 27643, 4470, 37779, 52282, 37305, 58104, 27783]
- Year 2008: [37449, 89831, 37779, 37636, 424680, 169430, 383346, 8688, 58104, 166642]
- Year 2013: [37449, 89831, 37779, 37636, 169430, 107583, 24386, 58104, 37305, 386849]
- Year 2018: [89831, 204024, 24386, 65502, 33186, 266635, 47559, 482376, 1762, 2002009640]
From this list, the following contributors were observed to maintain their importance and influence in 3 out of 4 years, spanning the 18-year period: [37449, 37779, 58104, 89831]. This could represent their underlying strength in recipe quality that keep users coming back to interact. Hence, we decide to study these contributors further via ego-networks.
Ego Networks
Extending from the network influence analysis, we studied the ego networks of selected contributors in the Top 10 highest PageRank scores for each of the 4 years of interest. As there were no users who were in the Top 10 for all 4 time periods, 4 contributors that featured in 3 out of the 4 years were chosen for deep dives. Descriptive analysis was done for egos’ recipe ratings, number of ingredients, time, steps and topics. Further, three evaluation measures were used to holistically understand the nature of the ego-networks:
- Degree Centrality: An ego with a larger number of edges has a larger reach. New users joining the network are more likely to connect with this ego by the principle of preferential attachment. Since the client is interested in maximising impressions, the absolute number of immediate neighbours was used instead of a normalised score.
- Similarity Ratio: Ego networks should intuitively be more homogeneous within themselves compared with the population at large. The ego’s neighbours should be more similar to the ego than any other random node in the network which would make recommendations to members in the group more effective. A ‘Similarity Ratio’ defined by (SimRank score of ego network / SimRank score of entire network) was derived to compare ego-networks across the years, using the ego as the source node for comparison with every other node. The higher the similarity ratio, the more similar the ego-network’s members are to the ego and one another than to the rest of the network.
- Efficiency: This measure is defined by (Effective size of ego-network / Actual size of ego-network). An ego with a higher efficiency has reach to more different groups within the larger network, whereas an ego with lower efficiency (or smaller effective size) would be able to cascade information more quickly as it has more cliques within its ego-network.
NetworkX’s ego_graph, effective_size, and simrank_similarity methods were used to perform the analysis on each yearly subset of the main dataframe, with appropriate filtering for a minimum floor of interactions between the users (i.e. n>2), and radius, where radius = 1 for an ego network showing immediate 1st degree neighbours only.
For visualisation, the graphs was exported in .gexf format and the ego networks with similar filters were applied. For better visualisation, the Yifan Hu algorithm was used alongside the gradient colouring scheme for each node. This colouring scheme was relative to the PageRank score, with more intense colours indicating a higher score. The thickness of the directed edges were relative to the edge weights, with thicker arrows having higher edge weights. Only graphs for 2008 and 2013 are included for this report as these are the 2 years when all 4 top nodes appeared.
Ego-networks of selected top nodes in 2003
Contributor | No. of neighbours | Similarity ratio | Effective size | Efficiency |
37449 | 48 | 2.60 | 35.7 | 0.74 |
37779 | 117 | 2.87 | 113.43 | 0.97 |
58104 | 48 | 2.67 | 35.38 | 0.71 |
Of the 3 top nodes in 2003, Contributor 37779 had the highest in-degree centrality, similarity ratio and efficiency.
Ego-networks of selected top nodes/ contributors in 2008
2008 was the first year when all 4 nodes of interest were in the top 10. Overall, degree centrality of all nodes was higher than in 2003. 89831 was the new entrant but became the largest hub with the highest degree centrality. It had a distinct network structure compared with the other nodes of interest, comprising many in-links that did not appear to be interconnected. This observation bears out in its high efficiency score, indicating that it was potentially a node connecting diverse groups. However, it was 37779 who had a much higher similarity ratio (8.15) than the rest, indicating that this ego-network has members who are very highly similar to one another. There is also indication from the large jump in similarity ratio, that between 2003 and 2008, 37779’s followers became more homogenous and closer-knit than when it first started. Finally, from the more intensely coloured edges revealed in 37449, it can be seen that this node is connected to more important neighbours.
Ego-networks of selected top nodes/ contributors in 2013
Generally, the size and structures of the ego-networks changed over the 4 time periods in tandem with overall network. Possibly due to the large drop in the total number of interactions in the network, degree centrality for these 4 top nodes dropped drastically from 2008 with the largest hub 37449 having only 72 neighbours compared with the largest hub in 2008 which had 482 neighbours. Contributor 37779 maintained a high similarity ratio of 8.32 indicating enduring engagement with his community of users. Although 89831’s number of neighbours fell by a factor of 10 since 2008, he would still be a useful hub for advertisers to reach more diverse groups in the network due to his very high efficiency score.
Ego-networks of selected top nodes in 2018
Contributor | No. of neighbours | Similarity ratio | Effective size | Efficiency |
89831 | 4 | 1.90 | 4.0 | 1.0 |
47559 | 12 | 2.47 | 11.33 | 0.94 |
Only contributor 89831 featured in the Top 10 in 2018 out of the 4 nodes of interest. Moreover, 89831’s ego network was very small, with only 4 neighbours. This was not too different from 2018’s largest hub, 47559, which appeared in the Top 10 list for the first time and had just 12 neighbours.
Community Detection
We applied community detection algorithms to (i) help our client identify and select the communities with characteristics that fit their advertising goal and (ii) achieve better outreach to targeted segment. To these ends, we performed community detection based on the unweighted, undirected user recipe bipartite graph (BG).
Three community detection algorithms from the CDLIB package, namely greedy modularity, label propagation (LPA), and Louvain algorithm, were used and the resulting community assignments were evaluated. As supported by several research studies, modularity maximisation algorithms are likely to suffer from resolution limit problem and may fail to identify communities smaller than a certain threshold, which depends on the size of the network and the degree of interconnectedness between the communities. To mitigate this problem, 10 resolution parameters from 0.6 to 1.5, with step of 0.1, were experimented under Louvain algorithm to find the community assignment with highest modularity score. In addition to modularity score, z-modularity score, which is a variant of the standard modularity proposed to avoid the resolution limit was also used to supplement the evaluation of community assignment quality.
To shortlist communities for deeper exploration, we used the power-law distribution of community assignments to identify the number of communities that comprise the majority proportion (60%) of nodes in each year.
Algorithms | Year 2003 | Year 2008 | Year 2013 | Year 2018 | ||||
Modularity | Z-mod | Modularity | Z-mod | Modularity | Z-mod | Modularity | Z-mod | |
Greedy modularity | 15 communities | 23 communities | 29 communities | 38 communities | ||||
0.38 | 1.13 | 0.37 | 0.87 | 0.64 | 1.86 | 0.84 | 2.94 | |
LPA | 10 communities | 31 communities | 294 communities | 65 communities | ||||
0.006 | 0.062 | 0.003 | 0.039 | 0.532 | 1.63 | 0.699 | 4.43 | |
Louvain | 23 communities (r = 1.2) | 27 communities (r =1.2) | 29 communities (r = 1.0) | 37 communities (r = 1.0) | ||||
0.38 | 1.524 | 0.39 | 1.11 | 0.64 | 2.04 | 0.85 | 2.83 |
Based on modularity & z-modularity scores, the results from the Louvain algorithm (with different resolution parameter, r, for each year of analysis) was selected as it produced better community assignments as compared to greedy modularity and LPA. The results are summarised in Table 2. However, given that LPA does not work to optimise modularity score, basing the evaluation of community assignment solely on these scores might favour modularity optimisation algorithms. Therefore, we also examined the community assignments by each algorithm manually to evaluate their accuracy and usefulness. The community assignment of LPA was confirmed to be unsatisfactory as it consisted of many singleton communities.
In addition, it was observed that the number of communities detected increased over the years. Upon further inspection, we noted that the number of interactions on the platform decreased significantly after year 2008, potentially due to drop in platform popularity, and that resulted in an increasing number of smaller communities in the more recent years.
Plotting the number of nodes in each community, we noted that the distribution follows power law. We applied a threshold of 60% to yield the shortlisted number of communities for each year, as illustrated in Figure 4. Top 5 communities of 2008 were selected for a deep dive as it had the highest number of interaction data.
Theme | Community 1 “Healthier” | Community 2 “Meat” | Community 3 “Bread” | Community 4 “Party” | Community 5 “Desserts/ Sweets” |
Top Contributors | 37449 | 89831, 37779 | 89831, 58104 | 89831, 37779, 37449 | 89831, 37779 |
Name | Chicken, Potato, Salad | Chicken, Cheese, | Chicken, Garlic, Bread | Chicken, Beef, Bean, Butter | Chocolate, Baked, Cream |
Ingredients | Oil, Cheese | Butter, Cheese, Flour, Oil | Butter, Cheese, Powder, Flour | Cream, Water | Cream, Milk, Baking |
Tags | Healthy, Fat, Carb | Meat, Comfort | Meat, Kid-friendly | Holiday, Event | Desserts, Comfort |
No. of steps | 7 | 8 | 8 | 9 | 9 |
No. of ingredients | 8 | 8 | 9 | 9 | 9 |
Minutes | 25 | 40 | 40 | 40 | 45 |
Avg. Rating | 4.7 | 4.5 | 4.5 | 4.4 | 4.5 |
Nutrition | Low calories, fat, sodium, carbs | High calories, fat | High calories, protein | High calories, sodium | High sugar, carbs |
Topic Modelling
For the year 2008, the top topics for each of the top 5 communities, compared with the assigned themes based on the descriptive analysis from Community Detection. These aligned well across all communities, with the exception of Community 1 which was earlier assigned as “Healthy”. This runs counter to the topic, which modelled it as “BBQ” which is not usually perceived as healthy foods. This could possibly be due to recipe contributors tagging their foods as “Healthy” in order to garner more views. Topic modelling provided a more accurate insight into the types of food preferred for each community.
Topic Modelling based on community
Topic Modelling based on Top Nodes/ Contributors
Topic modelling was also employed on the recipes published by top 4 contributors 37449, 37779, 58104 and 89831.
Recommendations
This project can bring value to aggregator platforms such as Food.com as it has developed an evidence-based analytical process for helping clients select the right profile of influencers for their marketing campaigns. Such a process can be applied to a variety of different contexts and time periods.
Based on our client’s profile, a US-based baking ingredient supplier, we recommend that the client engages influencers with profiles like Contributor 89831 for the following reasons:
- Such users have a wide reach and a more diverse audience, as evidenced by having the highest efficiency of all the other top nodes studied
- Such users are top contributors in several different communities that were highly relevant to our client’s line of products. Communities 3, 4 and 5 were related to bread, snacks and sweet foods respectively, and all have the potential to incorporate our client’s product into their recipes. Users in these communities will also generally be more interested in such recipes.
- Topic modelling results showed that such users have the skillsets to create recipes relevant to our client’s products, having contributed recipes involving cakes, cookies, chocolates and baking.
Future Work
There is potential to extend this project in several ways:
- First, we can develop a recommender system for recipes by building a collaborative filtering model based on users’ similarity within the same communities.
- Next, if additional UXUI data can be collected from user interactions with the Food.com website, we would like to obtain additional information such as click-through rate and interactions with other media such as photos to enrich our network attributes.
- Finally, we would like to apply the model to more recent data to study the impact of COVID-19 on user interactions on the site and if there were any change in preferences. While the current dataset allowed us to examine patterns in the network without the ‘spikes’ or short-term ‘fads’ during the 2-year period, based on our research, our hypothesis is that preferences during and post-COVID may lean more towards more complicated, but healthier, plant-based options since the lock-downs have inspired more people to cook from scratch and be more mindful of what they eat.