- Open Access
- Authors : Sami M K , Radhika Gupta , Kratika Gupta
- Paper ID : IJERTV10IS060166
- Volume & Issue : Volume 10, Issue 06 (June 2021)
- Published (First Online): 19-06-2021
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Predicting Customers’ Next Order
1st Sami M K
Department of Computer Engineering Indira College of Engineering and Management
Pune, India
3rd Kratika Gupta
2nd Radhika Gupta
Department of Computer Engineering Indira College of Engineering and Management
Pune, India
Software Development Engineer Amazon Web Services
Seattle, USA
Abstractthe popularity of targeted marketing has grown over the past few years. The trend of online shopping has become a new normal during the Covid pandemic. Customers buying products often leave behind a trail that helps us to predict the future. Understanding the costumers demand and their shopping pattern is the key to targeted marketing and is of immense value to companies. Using Machine Learning we can recognize the predictive patterns of the customers behavioral data. This can be used to automatically add products to the shopping cart. Thereafter, the user can review the products in the cart before ordering it.
Keywords Component; formatting; style; styling; insert (key words)
structure. Moreover, it supplies the week and hour of day the order was placed, and a relative measure of time between orders. In order to use this data we have used MySQL.
-
INTRODUCTION
The fact is, technology is collecting data with every single click. With this information, it becomes extremely easy for companies to improve their marketing strategies. Predicting customers' demands gives the company information to strategize and act accordingly. It also helps customers by automatically adding products to their cart. Such a model has competitive advantage over traditional methods. We introduce a model which uses a combination of persons previous order and the time interval between consecutive orders to predict their next order.
-
PROBLEM STATEMENT
To create a Machine Learning model that will help the user to determine their next order based on their previous ordering history. The model should determine the product, the interval after which the product will be ordered and the quantity of the products to be ordered.
-
DATASET
The dataset was released by Instacart by the name The Instacart Online Grocery Shopping Dataset 2017. It is a set of files that has customers order history. The dataset holds 3 million anonymous orders of nearly 2 lakh Instacart users. It supplies the history of products between 4 and 100 bought by the customer.
The dataset is divided into 3 parts, prior, Train and Test. The data does not include data about reordered products and the number of orders of products are not same.
The dataset consists of five tables: products, aisles, departments, orders, and products and it has a relational
Fig.1. Relation diagram of dataset
-
DATA PREPARATION
We have generated five tables from the dataset which are Products, Aisles, Department, Orders and Other Products. Then we joined these above tables into Productscombined (Department, Aisle, Products), Order combined (order_products_prior and orders). The Productscombined table contains all the details of each product and received 73,575 product ids. The order combined table has 73,000 records contains all the details of each order.
-
PROCESS
Fig.2. Systematic process
We have predicted the products that will be reordered based on the number of days since the last order, the day of the week, the time of the day and the products that the
customer adds first to the cart from 10,931 products. Later we merged the data from the combined tables for exploratory analysis. First, we
merged the productscombined table (has information related to products) and ordercombined1 table (has details about previous orders) and named this table as prioralldata.
Second, we
merged the productscombined table (has information related to products) and ordercombined2 (having details about trained orders) and named this table as trainalldata. We also created the top 10 most popular products by all product_name, within the department and within the isle.
We calculated the distribution of reorders based on the factors reorders each day of the week, each hour of the day, frequency distribution by days since prior order, distribution of orders vs reorders orders_prior table, distribution of orders vs reorders orders_train table, distribution of top-10 products orders_prior table, distribution of top-10 aisles orders_prior table.
After observing the distribution of orders on different days of the week per hour we found that most of the purchases were made between 9 am and 7 pm i.e., during office hours, however on the weekends the scenario was slightly different. On Saturdays, the number of orders increased steadily from 9 am and dropped sharply after 4 pm. On the other hand, on Sundays the orders peaked at 10 am and dropped every hour till 5 pm. Then we calculated which products were popular purchases on weekends (with respect to orders_prior table). This showed us that mostly people bought organic fruits and veggies on the weekends.
Fig.3. productcombined
Fig.4. ordercombined1
Fig.5. ordercombined2
-
FEATURE SELECTION
-
LASSO Regression
Lasso regression is a linear regression with L1 regularization. If we see the red point, it is deviated from the original deviation, this point is called outlier. Outlier could be because of human or experimental error or variability during the observation of data. Because of outlier we could not get an almost straight line. The predicted value is far from the actual value and it is because of gradient descent or cost function but because of the data.
LASSO involves a penalty factor that decides how many features are kept; using cross-validation to choose the penalty factor helps assure that the model will generalize well to future data samples. It automates feature selection based on standard linear regression by stepwise selection or choosing features with the lowest p-values.
We have used the LASSO regression algorithm to choose the first six features that will help in deciding the products that will be reordered.
Fig.6. Top 6 features selected using LASSO
-
SelectKBest Algorithm
If we see the redpoint, it is deviated from the original deviation, this point is called outlier. Outlier could be because of human or experimental error or variability during the observation of data. Because of outlier we could not get an almost straight line. The predicted value is far from the actual value and it is because of gradient descent or cost function but because of the data.
A penalty factor in LASSO decides how many features are to be kept; the penalty factor is chosen using cross-validation which makes sure that the model generalizes to future data samples. It automates feature selection based on standard linear regression by stepwise selection or choosing features with the lowest p-values.
We have used the LASSO regression algorithm to choose the first six features that will help in deciding the products that will be reordered.
Fig.7. Top 6 features chosen
We added a new feature based on reorders in relation to total number of products and found that around 60% of all the products have been reordered.
We performed feature selection with SelectKBest and LASSO. Both the algorithms gave almost similar results, so we decided to choose first six features that are order_number, 'add_to_cart_order', 'days_since_prior_or der', 'order_hour_of_day', 'product_id', order_id' to predict which products will be reordered.
-
-
DATA CLEANING
To clean the data, we replaced all the NaN and infinity with the mean value from enron_df. We dropped categorical data as only numeric data goes in not machine learning algorithms.
-
SELECTION OF MODEL
Fig.8. Analysis of each model
Based on the above table we selected Random Forest Algorithm since it provided highest accuracy.
-
Random Forest Model
Random Forest is a tree-based Machine Learning algorithm. Multiple decision trees are constructed and trained on sample drawn from the original dataset. An average of the individual from each decision tree and a majority class vote in a classification task are the result in case of regression task. Higher the number of trees in the forest higher the accuracy. Random Forest Algorithm:
-
Select random k points from the training set.
-
Build the decision tree with the selected data points.
-
Choose the number of decision trees that you want to build.
-
Repeat steps 1 and 2.
-
For each data point find the prediction of each tree and make the final prediction based on majority votes.
-
-
We needed to decide the number of trees. Though greater numbers of trees improve the quality of classification, it makes the code work slower. We checked the accuracy, precision and recalled for number of trees equal to 120, 300, 500, 800 and 1200. Based on the output we built the Random Forest Classifier model with default parameter of n_estimators = 1200. So, we used 1200 decision-trees to build the model.
To increase the accuracy, we altered few parameters like max_depth, max_sample_split, max_leaf_nodes and max
_features.
-
The maximum depth i.e., the nodes are expanded until all leaves are pure or until all leaves have less than min_samples_split samples. We tested for max_depth equal to 5, 8, 25, 30, and none. We selected Max_depth = 25 as it gave us the best result.
-
max_sample_split is the minimum number of samples needed to split an internal node. Its default value is
2. We checked value against 2, 5, 10, 15, and
100. Max_sample_split value equal to 2 gave us the best answer.
-
max_leaf_nodes are the maximum number of leaves in the tree. We checked it against the value equal to 2, 5, 10 and none. The choice none gave us the best answer.
-
max_feature is the number of features to consider when looking for the best split.
-
At the end we tested the max_features. The search for the split stopped when we got at least one valid partition of node samples. Now we could finalize all the parameters for Random Forest. We compared the results with data that had
no 'add_to_cart_order' and 'product_id' because we did not have this information in our test data set.
Fig.9. Analysis of the model
-
-
FUTURE SCOPE
This Machine Learning model could be used by target strategists to increase the market value of the supermarkets and online grocery stores. The algorithm could further be extended on other data sets. For example, it could be trained on pharmaceutical stores dataset to automatically order medicines of regular customers for example patients suffering from diabetes, low blood pressure etc. The accuracy of the model can further be increased by deploying other models in place of the Random Forest or LASSO regression models.
-
CONCLUSION
Using various Machine learning algorithms like LASSO, SelectKBest, and Random Forest classifiers we have predicted the date, time and the products for the next order of the customer. After testing out the model we received an
accuracy of 89%. This ensures that there are endless possibilities to which this model can be expanded. Moreover, using this prediction, the supply chain industries can enhance their marketing strategies. This also provides a platform for the users where they must do minimal work.
REFERENCES
-
Niu, X., Li, C., & Yu, X. (2017). Predicitive analytics of e-commerce search behavior for conversion.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.
-
Python. (2017). Python website. Retrieved from https://www.python.org/
-
Randomforestclassifier. (2017). Retrieved from http://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestCla ssifier.html
-
Russell, S., & Norvig, P. (1995). Artificial intelligence – a modern approach. PrenticeHall, Englewood Cliffs: Artificial Intelligence.
-
Lee, M., Ha, T., Han, J., Rha, J., & Kwon, T. (2015). Online footsteps to purchase:Exploring consumer behaviors on online shopping sites. In Proceedings of the ACM Web Science Conference.
-
Manning, C., Raghavan, P., & Schuetze, H. (2008). Introduction to information retrieval. Cambridge University Press.
-
Kaggle the home of data science and machine learning. (2017).
Retrieved from https://www.kaggle.com/
-
Alsanad, Ahmed. "Forecasting Daily Demand of Orders Using Random Forest Classifier." INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY 18, no. 4 (2018): 79-83.