On e-commerce sites like Amazon, there may be millions of different products to choose from. For instance, a customer on Amazon could type in "casual shirt for summer"
and receive a list of available shirts. A system has been set up in the background that retrieves images related to the user’s query, ranks them according to their similarity to the query, and then shows them to the user.
In this post, I describe how to create a search system that works like that of e-commerce platforms as described above.
Basic requirements
To summarize the problem statement, we are designing an image search system that retrieves images similar to the user query, ranks them based on their similarities to the text query, and then displays them to the user. For simplicity, we make some basic assumptions which include:
- Users can only input text queries (images or video queries are prohibited).
- We do not need to personalize the result of the search system.
- We have a dataset of say one million
<text query, image>
pairs for model training. - The model uses image metadata and pixels. Also, users can only click on an image and we can construct training data online and label them based on user interactions.
1. Overview of the search system
This kind of problem is known as a ranking problem. The goal of ranking problems is to rank a collection of items such as images, websites, products, etc., based on their relevance to a query, so that more relevant items appear higher in the search results. Many ML applications, such as recommendation systems, search engines, document retrieval, and online advertising, can be framed as ranking problems.
To determine the relevance between an image and a text query, we utilize both image visual content and the image’s textual data. An overview of the design can be seen in Figure 1.3.
1.1 Text search
Images with titles, descriptions, or tags that are the most comparable to the text query are displayed as the results of the text search. When a user enters a text query like “casual shirts for summer”, Figure 1.4 demonstrates how text search functions. Using full-text search in a database is one method to find the image description that matches the text query the most.
Full-text search (FTS) is searching a portion of text within a large collection of electronically recorded text data and providing results that include part or all of the terms in the search [2]. It retrieves documents that don’t perfectly match the search criteria. Documents here refer to database entities containing textual data. This technique is not based on machine learning hence it is fast as there’s no training cost involved. Many search engine companies use search engines such as Elasticsearch [4], and MongoDB Atlas Search [2] to return the results of text queries similar to the catalogue queries as seen in Figure 1.4.
1.2. Image visual search
This component outputs a list of images after receiving a text query as input. Based on how closely the text query and the image resemble each other, the images are ranked. Representation learning is a method that is frequently used to do this. In representation learning, a model is trained to turn data like images and texts into representations referred to as embeddings.
The text query and image are encoded individually using two encoders in this method. This is significant because words and images must be translated into numerical representations that computers can understand, known as “embeddings,” since computers only comprehend numerical data. To create an embedding vector for all of the images of the products that are currently accessible, we first apply a machine learning model to encode the images. Figure 1.5 illustrates how similar images are mapped onto two points in close proximity within the embedding space. The text encoder we employ creates an embedding vector from the text after that. Lastly, we use a dot product of their representation to determine the similarity score between the image and text embedding.
We compute the dot product between the text and each image in the embedding space, then rank the images according to their similarity scores to determine which images are most visually and semantically comparable to the text query.
2. Data Preparation for Training
2.1. Data engineering
We assume we are given an annotated dataset to train and evaluate the model. We could have users, images and user-image interaction data.
2.1.1. Images
The system stores catalogue items and their metadata. The table below shows a simplified example of image metadata
Image ID Upload time Manual tags
10 105743135 cotton, polo
193 1958351341 long-sleeve, flannel, polyester
2.1.2. User-image interactions data
Many kinds of user interactions are contained in interaction data. The type of queries users enter and their interactions with it may be revealed by this kind of data. Clicks, impressions, and purchases (conversion, add-to-cart, etc.) are the three main sorts of interactions, albeit they could be noisy. The table below displays a condensed example of user-image interaction metadata; we shall discuss this further.
User ID Text query Displayed img ID Inter. type Timestamp
10 White cotton polo 9 Click 1658451365
193 Men's summer shirt 15 Click 1648452360
2.2 Feature engineering
Nearly all algorithms for machine learning only accept numeric input values. During this step, unstructured data like texts and images must be transformed into a numerical representation. We outline the process for getting the text and image data ready for the model.
2.2.1. Preparing the text data
As shown in Figure 1.7, the text is typically represented as a numerical vector using three steps: text tokenization, text normalisation and token to IDs [5] which we will not be covered here.
2.2.2. Preparing image data
In earlier section 1, we explained the goal is to design an image search system that retrieves images similar to a user’s text query, ranks them based on their similarities to the text query, and then displays them to the user. Since the model returns an image as output, we need to preprocess catalogue images. The most common image preprocessing operations are [1]:
- Resizing: Models usually require fixed image sizes (e.g. 224∗224)
- Scaling: Scale pixel values of each image to be in the range of 0 and 1
- Z-score standardization: Scale pixel values to have a mean of 0 and variance of 1
- Consistent colour mode: Ensure images have a consistent colour mode (e.g. RGB or CMYK)
3. Model Development
3.1. Embedding Model Selection
As discussed in the Overview of the search system and visualised in Figure 1.6, text queries are converted into embeddings by a text encoder, and images are converted into embeddings by an image encoder. In this section, we examine possible model architectures for each encoder.
3.1.1 Text encoder
A typical text encoder’s input and output are shown in Figure 1.8.
The text encoder converts texts into a vector representation. For example, if two separate sentences have similar meanings, their embeddings are more similar. To build the text encoder, two broad categories are widely available: statistical methods and machine learning-based methods.
3.1.2 Image encoder
For encoding images, we propose using neural networks because it has proven to be good at handling texts and images and able to produce the embeddings needed for representation learning. CNN-based architectures such as Residual network or ResNet [9] have an impressive 152 layers. The key to the model design is the idea of residual blocks that make use of shortcut connections. These are simply connections in the network architecture where the input is kept as-is (not weighted) and passed on to a deeper layer, e.g. skipping the next layer.
3.2. Model Training
To retrieve images similar to the text query, a model must learn representations (embedding) during training. In this section, we discuss how to train the text and image encoder using contrastive learning [10]. In this approach, we train a model to distinguish between similar <text query, image> pairs from dissimilar ones. In other words, we provide the model with a <text query, image> pair, one similar pair, and a few dissimilar pairs. During training, the model learns to produce representations in which similar text query—image pairs resemble the pair, than other pairs.
Constructing the dataset
As described earlier, each data point used for training contains a query-image pair, a positive query-image pair that’s similar to the pair, and n-1 negative pairs that are dissimilar to the pair. The ground truth label of the data point is the index of the positive text query-image pair. As Figure 2.0 depicts, along with a query-image pair q, we have n other images of which one is similar to the q (casual men’s beach summer shirt) and the other n−1 images are dissimilar. The ground truth label for this data point is the index of the positive image, which is 3 (the third image among the n images in Figure 2.0)
3.3 Ranking and loss computation
Ranking is a slower — but more precise — step to score and rank top candidates. As we’re processing fewer items (i.e., hundreds instead of millions), we have room to add features that would have been infeasible in the retrieval step (due to computing and latency constraints). Such features include item and user data, and contextual information. We can also use more sophisticated models with more layers and parameters [11].
Loss function selection
The goal of the training is to optimize the model parameters so that similar text query—image pairs have embeddings close to each other in the embedding space. Ranking can be modelled as a learning-to-rank or classification task, with the latter being more commonly seen. In this article, we are proposing using deep learning where the final output layer is a softmax over a catalogue of images. We could also use a sigmoid to predict the likelihood of user interaction (e.g., click, purchase) for each query-image pair. The loss computation steps can be summarised as follows:
- Generate embeddings: Use text and image encoders to generate the embeddings for the search query and catalogue images.
- Compute similarities: Compute the similarities between the text query’s embedding and the catalogue image embeddings. There are different measures of similarities we could use and they have their pros and cons, examples include the dot product, cosine similarity, and the euclidean distance among others.
- Softmax function: This is applied over the computed distances to ensure the values sum up to one, by so doing we can interpret the similarity values as probabilities.
- Cross-entropy:Cross-entropy: Cross-entropy measures how close the predicted probabilities are to the ground truth labels. When the predicted probabilities are close to the ground truth it shows the embeddings can distinguish the positive query-image pair from the negative ones.
Figure 2.1 visualises the system architecture, we see the model takes text query as input, produces embedding for this query, and computes the similarity between this embedding and the embedding of each image in the catalogue.
4. Evaluation
Evaluating the performance of the predictions is a critical step and this section focuses on this. The evaluation metrics can be classified into offline and online metrics. Aminian & Alex Xu (2023) covered this in-depth which would I summarised in the subsequent subsections.
4.1. Offline metrics
Offline metrics are the metrics that evaluate the performance of the model and they are typically not business related (in the sense that they do not measure KPIs business care about such as click-through rate) but are still useful to evaluate the model’s performance on evaluation data. Offline metrics suitable for this problem are Precision@k, mAP, Recall@k, and Mean reciprocal rank (MRR).
Precision@k
This metric measures the proportion of relevant items among the top k items in the ranked output list. In the evaluation dataset, a given text query is associated with only one positive image since we are using a contrastive learning approach. That means the numerator of the precision@k formula is at most 1 which leads to low precision@k values.
precision@k= Number of relevant images among the top k images / k
Recall@k
This metric measures the ratio between the number of relevant images in the search results and the total number of relevant images.
recall@k = (Number of relevant images among the top k images) /
(Total number of relevant images)
Since we are using contrastive learning (one similar query-image pair and other dissimilar images), the denominator (total number of relevant images) is always 1. Hence, we can surmise the recall@k formula is equal to 1 if the relevant image is among the top k images else it’s 0. This metric measures a model’s ability to find the associated image for a given text query however, despite being intuitive and easy to derive, this metric is not without its faults which include:
- It depends on the choice of k and choosing the right k could be challenging.
- When the relevant image is not among the k images in the output list, recall@k is always 0. For instance, suppose a model X ranks a relevant image at place 15, and another model Y ranks the same image at place 20. If we use recall@10 to measure the quality of both models, both would have a recall@10 = 0, even though model X is better than model Y.
Mean Reciprocal Rank (MRR)
This metric measures the quality of the model by averaging the rank of the first relevant image in the search result where m is the total number of output lists and rank i refers to the rank of the first relevant image in the i^th ranked output list.
4.2. Online metrics
In this section, we explore some commonly used online metrics for measuring how quickly users can discover product images they like. Online metrics are metrics tied to a user’s activity that can be translated to business outcomes, they include click-through rate (CTR), average duration spent on suggested images, average revenue from suggested images, etc.
Click-through rate (CTR)
This metric shows how often users click on the displayed images. CTR can be calculated using the formula:
CTR = Number of clicked images / Total number of suggested images
A high CTR indicates that users click on the suggested images often, however, it does not indicate or track whether the clicked images are relevant or not.
Average duration spent on the suggested images
This metric shows captures how engaged users are with the recommended images. Duration varies according to business objective, it could be daily, weekly, or monthly time measures in minutes or even hours. When the search system is accurate, we expect this metric to increase.
5. Serving
At serving time, the system displays a ranked list of product images relevant to a given text query. Figure 2.2 shows the prediction pipeline, and text and image indexing pipeline. Each pipeline is discussed in the original post in detail.
Conclusion
It’s vital to keep in mind that while we have simplified the system design for product search in e-commerce platforms, it is actually considerably more complex. Other elements such as adding popularity and freshness, increasing product features and user interaction, expanding text queries to support other languages, etc. were not included in this design.
[11] provides an overview of the offline-online, retrieval-ranking pattern for search and recommendations used by major companies such as Alibaba, Facebook, LinkedIn, etc. They also distinguish the latency-constrained online systems from the less-demanding offline systems and split the online process into retrieval and ranking steps.
For brevity sake, a lot of detail have been spared in this post but are in the original article posted on https://www.babaniyi.com/2023/03/22/designing-a-recommendation-system-for-search-in-ecommerce.html
References
- Ali Aminian & Alex Xu (2023). Machine Learning System Design Interview.
- Full Text Search with MongoDB
- How To Improve Database Searches with Full-Text Search
- Elastic Search
- Preprocessing text data
- An Overview of Text Representation in NLP
- NLP text encoding
- Word2Vec
- ResNet
- A Simple Framework for Contrastive Learning of Visual Representations
- Yan, Ziyou. (Jun 2021). System Design for Recommendations and Search