Creating a product tagger PoC

6 min readNov 29, 2023

If you have read some of my earlier articles, you might already know, that I am really obsessed with good data. The reason is simple: even if you make a basic search, it is a lot easier to do with good data, than optimising to be good despite of the data.

If you try something more complex with bad data (e.g. personalising), you will realise that if you put garbage in, you get garbage out, so it might be good to work on that data upfront. However, good data is rarely free, you might need to invest a lot in that, especially if you are doing it manually. I have been thinking a lot recently, how this can be automatized with ChatGPT and I will show you a potential solution for adding tags to products. (Tags can be for instance suitable diets for a food, memory for a phone, industries for a report etc.) Please note that it is not out of the box, but you can use the thought process for making your own tagger.

If you want to try this PoC on your own, set up a paid ChatGPT account, load there a few bucks (you really don’t need much) and go to https://platform.openai.com/playground/ Don’t set up there assistant for now, just go with chat. Add the prompt described below to system and the input (product data) to user. Use the GPT-4 model as it is far more accurate for the task than GPT-3.5

Setting the scope

First of all, you need to set the scope: for what products what tags you want to add. As an example for tagging I use foods now. This is a good example as:

tags are important here (bio, diets etc.)
the input data is limited — if you have too much input data, the cost increases and the chance for wrong output is increasing.

Let’s set now a minimum scope and check whether the product is:

vegan
vegetarian
gluten-free
sugar-free
lactose-free = false

There are several other diets and also allergenes, but let’s stick now to a PoC. I would recommend you to do the same when you start your tagging project.

Selecting the data

Next to select is what input data to choose? As an example, let’s go with Ocado — a British egrocery — and see what data they have for their products? On their website we see the following fields, which might be relevant for our case:

Product title
Product description
Nutritional values
Ingredients
Allergenes
Dietary information
Brand
Categories → Dietary & Lifestyle
It is not visible easily, but probably in the backend you should also have category of the product, e.g. “bread / sliced”

You might think that the more data you have, the better the outcome is. However, it is not true, too much data can increase the chance for errors, so you should try to select only the data that really helps you. Let’s check again the fields we had:

Product title: I recommend using it. ChatGPT has some historic data, so just based on the title it might be able to do the work we want.
Product description: Avoid it. It can cause lots of hallucination. E.g. if it is mentioned that a cookie is good with milk, ChatGPT might take it as if the milk would be an ingredient. Also, as this is usually long, it might increase your costs.
Nutritional values: In some cases, it might be useful, but not for the current purpose.
Ingredients: Pretty important for our purpose. If you have this inaccurate, the quality of the tags will be worse, but as said at the title: sometimes ChatGPT will be able to add you good tags for known products — just check the accuracy.
Allergenes: Yes, please. This is the same as ingredients.
Dietary information: In case of Ocado, this might have content as vegan, vegetarian etc. If you are confident that this data is good, maybe some machine learning project is better for you. As we are doing this excercise to make good data better, avoid using it.
Brand: Some brands are vegan etc. So, in some cases, it might help. I would try adding it in a bigger project, but for now, let’s keep things simple and exclude it.
Categories → Dietary & Lifestyle: Similar to dietary information, avoid it now.
It is not visible easily, but probably in the backend you should also have category of the product, e.g. “bread / sliced”: It might help if you have good gluten-free, lactose-free etc. categorisation, but for simplicity, let’s eclude it now.

So, we have: title, ingredients and allergenes fields.

Initial assistant setup

When you are setting up an assistant, I recommend you to give it some context about the task:

You help to our egrocery site to tag products by diets.

Tell about the data — including some limitations of the data:

Your answer is based on the product name, ingredients and allergenes given by the client. Please note that allergenes and ingredients might be imperfect.

Describe the task:

Answer the following questions with True or False. If you are not sure, answer False. (Note: You can also ask N/A or similar for unsure.)

Is this product vegan?
Is this product vegetarian?
Is this product gluten-free?
Is this product sugar-free?
Is this product lactose-free?

Tell about the desired about:

Format your answer as:
vegan = False
lactose-free = True
Do not add any other text, only the text asked above.

It is pretty important to add strict definition of the about, so that you can process it. You can also give a json etc. output format, but in this case you can afford to do post processing, so I would recommend to go with something like above due to simplicity and lower costs.

You might also want to add the input format, but the prompt should work now without it.

Refining

After doing the initial setup above, you should try your initial tagger in practice. Add product details to system, I selected:

“Product name:” “M&S Lemon Cheesecake Slices”; “Ingredients:” “Full Fat Soft Cheese (Milk) (23%), Whipping Cream (Milk) (15%), Wheatflour contains Gluten (with Wheatflour, Calcium Carbonate, Iron, Niacin, Thiamin), Sugar, Lemon Curd (7%) (Water, Sugar Pasteurised Egg, Unsalted Butter (Milk), Concentrated Lemon Juice, Cornflour, Lemon Oil, Colour: Lutein), Dextrose, Lemon Juice, Unsalted Butter (Milk), Pasteurised Egg, Palm Oil, Palm Kernel Oil, Concentrated Lemon Juice, Cornflour, Invert Sugar Syrup, Demerara Sugar, Rapeseed Oil, Chicory Fibre, Gelling Agent: Pectin (from Fruit), Acidity Regulator: E331, Potato Starch, Salt, Raising Agent: Sodium Bicarbonate, Stabiliser: E417, Colour: Carotenes, Thickener: Xanthan Gum”; “Allergenes”: “Contains Cereals Containing Gluten, Contains Eggs, Contains Milk, May Contain Nuts, May Contain Peanuts, Contains Wheat.”

You can use a different format, just make sure that ChatGPT can distinguish between the different fields.

The output will be:

vegan = False
vegetarian = True
gluten-free = False
sugar-free = False
lactose-free = False

All correct, right? However, if you run many examples (do not forget to clear history between them), you might see gluten-free, sugar-free and lactose-free running well, but vegan and vegetarian can be a bit more challenging. After you see there the common errors, you might want to define better what does vegan and vegetarian mean and replace those with the following:

Is this product vegan? (A vegan product has no animal-based ingredients. This includes meat, honey, fish and seafood, eggs, dairies etc. Carmine and shellac- otherwise known as E120 and E904 are also not vegan.)
Is this product vegetarian? (Besides other ingredients, fish and seafood, carmine and shellac- otherwise known as E120 and E904 are not vegetarian.)

Why the changes were done that way? I realised that in many cases honey and seafood were not recognised as dealbreakers for vegan food. Also, carmine (an insect-based coloring ingredient) and shellac (an insect-based additive making chocolate shiny) are also ignored. I needed to add the E numbers of them as well as ChatGPT did not know them by default.

Ideally, you should have a “golden set” of products with perfect tags and you could validate your tagger on them and see the weak spots of your setup. If that is not the case, you will need to do your validation manually.

Conclusion

Overall, while this simple PoC does not have a full accuracy, as I have seen, for many cases it is better than what we have in most egrocery sites. You might use it as a starter, adapt it to your system — even if it is outside the egrocery domain — and scale it up as a proper automated solution. With the recent drop on ChatGPT prices, it might be a cheaper, more accurate and longer-term solution for your data issues than using human workforce.