Assortment Recommendation Using Reinforcement Learning for CPG companies

November 8, 2019 0 comments


We consider a problem of assortment recommendation in a CPG Beverages industry. Consider a scenario wherein a sales representative (rep.) of a retail store is provided with an app in which he/she receives a recommendation of assortment of beverages needed to be pushed to the store on a monthly basis. The goal is to provide recommendations that are aligned with the sales rep.’s interests which are unknown a priori. We first model this problem using famous Multi-Armed Bandits (MAB) framework which is a flavor of the celebrated Reinforcement Learning paradigm. We then suggest an algorithm which incorporates the feedback received from the sales rep. in order to make recommendations that are aligned with sales rep. interests. Essentially, the suggested algorithm learns the sales rep.’s interests from the received feedback. Finally, we test the suggested algorithm’s performance on the synthetic data and show that it is indeed making the recommendations those are aligned with sales rep.’s interests.


 Problem statement

Develop a reinforcement learning based solution involving a ‘ K’ Armed Bandit methodology for the following recommendation problem:  ABC_Beverages  today has an app for the Sales Reps  who are ‘presented’  with a certain assortment recommendation of SKU’s  for them to push to a store.  The sales rep may or may not select all the ‘n’  recommendations the app suggests to push.  Assuming that the app proposes every time ‘n’   SKU’s  to a sales rep for a particular store, and the concerned sales rep only accepts ‘r’   SKU’s   out of them. Develop a solution which  ‘learns’ the  choice selected by the sales rep and incorporates the learning for future recommendations so as to  narrow the gap ‘n-r’.

Reinforcement learning: A brief introduction

Key features of RL

•  A reinforcement learning algorithm, or agent, learns by interacting with its environment.

•  The agent receives rewards by performing correctly and penalties for performing incorrectly.

•  The agent learns without intervention from a human by maximizing its reward and minimizing its penalty.


Fig. 1 RL Schematic


Mathematical description of the problem


•  The first assumption is we would be building the model on synthetic data  as a first development step.

•  The recommendations would be by store ; thus the algorithm which would be developed for a store would be applied to other stores as well

•  The recommendations would be by store on a monthly basis; with this assumption and with that of having possible actual data for 3 years, We assume to have only 36 data points or recommendations available with us per store for which we have a feedback (sales rep choice over our recommendations)

•  We have a static scenario for the first model development (Environment does not change )

•  Say we have ‘ni’ SKU’s  as part of a monthly recommendation ; we assume that this choice is follows an unknown probability distribution and is independent of other recommendations


Reinforcement learning model

•  The above assumptions lead us to frame the problem in the framework of a MDP (Markov Decision Process) which is a broad framework to pose a  Reinforcement learning problem

•  The actual technique we would apply would be from the family of Stochastic Bandit problems .

•  For the current problem at hand we propose a solution methodology using the UCB1 algorithm described in Appendix


The Model in layman’s terms:

Fig. 2  RL model flow schematic : High  Level


Reinforcement learning based proposed solution: UCB1 algorithm

We  now present an algorithm named UCB1 (Upper Confidence Bound) for the aforementioned recommendation problem. This algorithm was proposed by Auer et. al. in [1]. It has two phases such as Initialization and Loop. We need the following notation in order to introduce the algorithm. Note that, the following algorithm is given for a particular store-i and the same can be applied at other stores by replacing the data corresponding to the respective stores. We treat each recommendation as one round.


•  Let Ni be the total number of beverage brands available to recommend for store-I i.e., size of the universe

•  Let ni be the number of beverage brands that are recommended to store-i at a time or at once

•  Let rj,k denotes the observation received for the beverage brand-j when it is recommended for the kth time

•  Let Tj(t) denotes the number of times the beverage brand-j has been recommended till round-t


Initialization phase:

Recommend each beer brand-j exactly once and receive the corresponding observation rj,1


Loop phase:

In round-t: Calculate indices for each brand-j as follows.

Recommendation in round-t:

Sort the above UCBj(t) values in descending order and choose the top ni respective beer brands as a recommendation. Note that, if we have some data regarding these recommendation problem then we can replace the above Initialization phase with by doing some calculations using the available data.


Synthetic data generation

We generate two sets synthetic data for testing UCB1’s performance whose details are given as follows.


Details of Dataset-1 generation:

•  We had divided the universe of Beverage brands (SKU) for a store into two sets, R1 (Preferred brands by sales rep) and R2 (Not preferred by sales rep) ; We have chosen the size of R1 and R2 as 12 and 18 respectively. We have chosen number of beverage brands recommend at any round is limited to 10 i.e., ni = 10.

•  If the recommendation algorithm recommends a beverage brand which is present in the set R1 – We were giving that a score of 1 (sales rep selects) and if it presents in R2  we set a score of 0 (sales rep does not select)


Details of Dataset-2 generation:

•  We had divided the universe of Beverage brands (SKU) for a store into two sets, R1 (Preferred brands by sales rep) and R2 (Not preferred by sales rep) ; We have chosen the size of R1 and R2 as 8 and 22 respectively. We have chosen number of beverage brands recommend at any round is limited to 10 i.e., ni = 10.

•  For the preferred set R1, we now call a Bernoulli distribution with probability 0.8 – This ensures that a preferred brand has a high probability ( But not absolute chance ) of getting selected

•  For the non preferred set R2, we now call a Bernoulli distribution with probability 0.6 – This would give the non preferred SKU sets a small chance to still be selected and make the simulation have a more real feel



In this section we detail the implementation plan to incorporate a basic RL algorithm ‘UCB1’ on a synthetically generated data and  simulate incorporating the feedback based on a sales representatives suggestions over a base recommendation proposed.


Snapshot of synthetic data

As depicted in Fig 3. The tabular data has essentially two parts. For the left had table, each row represents a set of 10 SKU recommendations  either from forecasting or other econometric models for the ‘i’ th generation of recommendations.  The table on the right is a heat map where green denotes if that SKU falls amongst the sales reps preferred choice of SKU’s for that store and red denotes otherwise. Thus a large ‘red’ area shows a greater mismatch between the current suggested recommendation from forecasting algorithms and that of the sales reps choice.



Fig. 3 RL Implementation  synthetic data tables


Step 1 : Tabulation


Based on the raw data we generate a table like below, the columns are explained as follows


Fig. 4 Structure of tabulation for RL implementation


Column explanations

•  Brand_cd: The table has as many rows as we have the universe of unique brands for the current store under consideration

•  Tbi: Total suggested occurrences : How many times in the raw data has this brand been suggested

•  Avg choice occurrence: The ratio :  (Total selections of  a brand by sales rep) / Total suggested occurrence: In the first model this number will always be a 0 or 1 as if the Brand is among the preferred brand we always select it else we never select it : This will be improved in the next plan of data generation


Step 2: Index calculations

For all the brands tabulated in the previous step, compute an  index as follows:


Here ‘t’ is the Generation or Row number that is say for the first 60 generations we did not apply RL thus the 61st recommendation generation would have us start with RL thus t = 61


Step 3: Sort Indices

Sort the Indices in above step in descending order and choose to recommend the top ten brands for the next generation – i.e. 61st row or 61st generation.

Step 4: Proceeding ahead

Repeat steps 1 to 3 for as many generations ahead you want to recommend

Step5 : End and Check

Check the sparsity index of the RL generated rows vs the original rows ;  The sparsity should decrease.





Fig. 5  Data table and sparsity may before and after RL implementation for dataset 1


Fig. 6  Data table and sparsity may before and after RL implementation for dataset 2


We observe from Fig. 5  That the generations post RL absolutely confirm to the sales Reps recommendations. However in trying to mimic a real world scenario for dataset 2  where we introduce  a selection probability even for the preferred SKU’s of the sales rep and those not preferred, we observe a less sparse heat map as seen in Fig. 6


In this study we presented a methodology to incorporate sale rep feedback to any existing forecast or suggestions for beverage sales, using the famous MAB technique of Reinforcement Learning and shown how we can modify or improve sales recommendations by incorporating real life ground feedback with forecasts from any mathematical models to have a choice set which may lead to higher sales at stores.  The concept presented here maybe further extended by econometrics practitioners wherein complex econometrics models implemented at CPG companies maybe further augmented with reinforcement learning leading shelf stock recommendations which may result in higher sales



The author wishes to Acknowledge the inputs and suggestions of  Dr, Ravi Kolla an alumnus of IIT Madras and an industry data science practitioner and  a subject matter expert in the field of reinforcement learning.



[1]. Peter  Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem.” Machine learning 47.2-3 (2002): 235-256.



Dr.  Anish Roy Chowdhury is currently an Industry Data Science Leader at a Digital Services company and is also one of the mentors for  Karmaa Lab. In previous roles he was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP he has 6 years of Work Experience on the IT sector as a Data Base Programmer.  During his stint in IT he has worked for Credit Card Fraud Detection among other Analytics related Projects.   He has a PhD in Mechanical Engineering (IISc Bangalore) . He also holds a MS degree in Mechanical Engineering from Louisiana State Univ. USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics

Dr. Anish is also a highly acclaimed public speaker with numerous best presentation awards from National and international conferences and has also conducted several workshops in Academic institutes on R programming and MATLAB. He also has several academic publications to his credit and is also a Chapter CO – Author for a Springer Publication.  He has extensively contributed to the revision of a bestselling MATLAB Book from Oxford University Press, being the sole contributor to chapters on Data analysis and Statistics.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.