Building a Product Recommendation System for E-Commerce: Part I — Web Scraping
Today, if we think of the most successful and widespread applications of machine learning in business, recommender systems could be one of the first examples people have in mind. Each time you purchase something online, you might see the “products you might also like” section. Recommender systems help users discover items they might like but have not yet found, which could help companies maximize revenue from upselling and cross-selling. As a Data Science Intern at ScoreData, I wanted to take the opportunity to try to build a recommendation model and analyze data on ScoreData’s ML platform (ScoreFast™). Since we don’t have customers’ purchase history from any of the E-commerce websites, I decided to build a content-based recommendation system using product descriptions and reviews. The idea underlying them is that if a user was interested in a product, we could recommend several products that are similar to the product the user liked.
The Goals of this project were to:
- Gather product information and reviews data from BackCountry.com through web scraping using Selenium, Beautifulsoup
- Perform an exploratory data analysis using ScoreFast™ platform
- Convert text data into vector
- Build a KNN predictive model to find the most similar products
- Use Sentiment Analysis on product reviews
- Use each review’s sentiment score to predict its review’s rating
- Generate word clouds to find the customers’ pain points
In this blog, I will only cover the data collection part. If you are interested to know more about the model building process, please check out my next blog.
Data Collection
I started the project by scraping the relevant data from an E-commerce site. In this post, I’ll share how to extract information from a website using Beautiful Soup, Selenium, and Pandas.
At the beginning of this project, I wanted to experiment with a smaller dataset first. Therefore, I only gathered data from the Women’s Dresses & Skirts category on Backcountry.com. I first went on the Women’s Dresses & Skirts page and extracted all the product URLs that were stored in the anchor element (<a>). For each product URL, I gathered product information such as product description, product details, tech specs, rating, number of reviews, and reviews contents.
In order to extract these data from the site, we first need to understand and inspect the web page to find the location where the information we want is stored. Sometimes, the information we want won’t show up unless we scroll down the page. In this case, we need to use a web driver to scroll down the page.
#get all the link link_list = []options = Options() options.add_argument("--headless")driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)for url in url_list: driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml')scroll_to = 0 for i in range(5): scroll_to += 500 driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');') time.sleep(1.5)innerHTML = driver.execute_script("return document.body.innerHTML") atag=soup.findAll('div', attrs={'class' : 'ui-pl-visible-content'})for link in atag: product_url=link.find('a')['href'] product_url='https://www.backcountry.com'+ product_url link_list.append(product_url)driver.close()[/box] #get all the link link_list = []options = Options() options.add_argument("--headless")driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)for url in url_list: driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml')scroll_to = 0 for i in range(5): scroll_to += 500 driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');') time.sleep(1.5)innerHTML = driver.execute_script("return document.body.innerHTML") atag=soup.findAll('div', attrs={'class' : 'ui-pl-visible-content'})for link in atag: product_url=link.find('a')['href'] product_url='https://www.backcountry.com'+ product_url link_list.append(product_url)driver.close()
Notice that if we add a “headless” argument, the web driver will be able to run in the backend. Otherwise, for each loop, a new window will pop up, and close itself after it successfully extracts the information you specified.
I’ve written a product_information function and product_review function to gather the data. Here’s the code for the product_information function.
def product_information(url): product={} options = Options() options.add_argument("--headless") options.headless = True options.add_argument("--window-size=1920,1200") driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options) # driver = webdriver.Chrome('/usr/local/bin/chromedriver') driver.get(url) #scroll down to where the review count is located on the page scroll_to = 0 for i in range(5): scroll_to += 300 driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');') time.sleep(1.5) innerHTML = driver.execute_script("return document.body.innerHTML") #use this if java-rendered page soup = BeautifulSoup(innerHTML, 'lxml') # product name product_name= soup.find('h1', {'class': 'product-name qa-product-title'}) if product_name is None: product['product_name']=Noneelse: product['product_name']=product_name.text # price price= soup.find('span', {'class': 'product-pricing__retail'})if price is None: price= soup.find('span', {'class': 'product-pricing__sale'}) #product-pricing__sale on sale if price is None: product['price']=None else: price = price.text product['price']=price else: product['price']=price.text# product description # ui-product-details__description # not all the product has description product_description= soup.find('div', {'class': 'ui-product-details__description'}) if product_description is None: product['product_description']=None else: product['product_description']=product_description.text # product info # prod-details-accordion__list product_details= soup.find('ul', {'class': 'prod-details-accordion__list'}) product_details = list(product_details.stripped_strings) product['product_details']=product_details #tech specs product_first=soup.find_all('div', {'class': 'ui-product-details__techspec-row'}) if product_first is None: product['tech_spec']=Noneelse: tech_spec={} #should be able to find rows from the tech spec table for i in product_first: tech_name=i.find('dt', {'class': 'ui-product-details__techspec-name'}) tech_name=tech_name.text tech_value=i.find('dd', {'class': 'ui-product-details__techspec-value'}) tech_value=tech_value.texttech_spec[tech_name]=tech_valueproduct['tech_spec']=tech_spec # review_count review_count= soup.find('span', {'class': 'review-count'}) if review_count is None: product['review_count']=Noneelse: review_count = review_count.text review_count = int(review_count.split(' ')[0]) product['review_count']=review_countdriver.close() return product
After I gathered the data, I noticed that a lot of the products don’t have any reviews. Therefore, I decided to scrape more data from the top 9 most popular outdoor brands’ best-selling products.
Create a DataFrame
Once we stored the relevant data into a dictionary, we can unpack a nested dictionary to a pandas DataFrame.
product_name=[] brand_name=[] price=[] product_description=[] product_details=[] tech_spec=[] review_count=[]for link in link_list: product_name.append(product_dict[link]['description']['product_name']) brand_name.append(product_dict[link]['brand_name']) price.append(product_dict[link]['description']['price']) product_description.append(product_dict[link]['description']['product_description']) product_details.append(product_dict[link]['description']['product_details']) tech_spec.append(product_dict[link]['description']['tech_spec']) review_count.append(product_dict[link]['description']['review_count']) key_list=set() for idx, spec in enumerate(tech_spec): for key in spec.keys(): key_list.add(key) key_list=list(key_list)#convert all the unique techs into key key_dictionary = defaultdict(list)for idx, spec in enumerate(tech_spec): for key in key_list: if key not in spec.keys(): key_dictionary[key].append(None) else: key_dictionary[key].append(spec[key]) tech=pd.DataFrame.from_dict(key_dictionary)
Now that we’ve combined all the data into one data frame, in the next step, I did some exploratory data analysis on ScoreFast to better understand the data.
Conclusion
In the next blog, I will explain more about how I built the product recommendation using this dataset.
Thanks for reading, and we hope everyone is staying safe and healthy. We are all hoping we can get back to normal soon. In the meantime, please check out our other blogs and stay tuned for more!