Building a Product Recommendation System for E-Commerce: Part I — Web Scraping

by Sep 13, 2020Blog, Other

Today, if we think of the most successful and widespread applications of machine learning in business, recommender systems could be one of the first examples people have in mind. Each time you purchase something online, you might see the “products you might also like” section. Recommender systems help users discover items they might like but have not yet found, which could help companies maximize revenue from upselling and cross-selling. As a Data Science Intern at , I wanted to take the opportunity to try to build a recommendation model and analyze data on ScoreData’s ML platform (™). Since we don’t have customers’ purchase history from any of the E-commerce websites, I decided to build a content-based recommendation system using product descriptions and reviews. The idea underlying them is that if a user was interested in a product, we could recommend several products that are similar to the product the user liked.

The Goals of this project were to:

  • Gather product information and reviews data from BackCountry.com through web scraping using Selenium, Beautifulsoup
  • Perform an exploratory data analysis using ScoreFast™ platform
  • Convert text data into vector
  • Build a KNN predictive model to find the most similar products
  • Use Sentiment Analysis on product reviews
  • Use each review’s sentiment score to predict its review’s rating
  • Generate word clouds to find the customers’ pain points

In this blog, I will only cover the data collection part. If you are interested to know more about the model building process, please check out my next blog.

Data Collection

I started the project by scraping the relevant data from an E-commerce site. In this post, I’ll share how to extract information from a website using Beautiful Soup, Selenium, and Pandas.

At the beginning of this project, I wanted to experiment with a smaller dataset first. Therefore, I only gathered data from the Women’s Dresses & Skirts category on Backcountry.com. I first went on the  and extracted all the product URLs that were stored in the anchor element (<a>). For each product URL, I gathered product information such as product description, product details, tech specs, rating, number of reviews, and reviews contents.

In order to extract these data from the site, we first need to understand and inspect the web page to find the location where the information we want is stored. Sometimes, the information we want won’t show up unless we scroll down the page. In this case, we need to use a web driver to scroll down the page.

Image for post
#get all the link link_list = []options = Options() options.add_argument("--headless")driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)for url in url_list: driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml')scroll_to = 0 for i in range(5): scroll_to += 500 driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');') time.sleep(1.5)innerHTML = driver.execute_script("return document.body.innerHTML") atag=soup.findAll('div', attrs={'class' : 'ui-pl-visible-content'})for link in atag: product_url=link.find('a')['href'] product_url='https://www.backcountry.com'+ product_url link_list.append(product_url)driver.close()[/box]  
#get all the link
link_list = []options = Options()  
options.add_argument("--headless")driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)for url in url_list:   
    driver.get(url)   
    soup = BeautifulSoup(driver.page_source, 'lxml')scroll_to = 0
    for i in range(5):
        scroll_to += 500
        driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');')
        time.sleep(1.5)innerHTML = driver.execute_script("return document.body.innerHTML") 
        atag=soup.findAll('div', attrs={'class' : 'ui-pl-visible-content'})for link in atag:
        product_url=link.find('a')['href']
        product_url=' product_url
        link_list.append(product_url)driver.close()

Notice that if we add a “headless” argument, the web driver will be able to run in the backend. Otherwise, for each loop, a new window will pop up, and close itself after it successfully extracts the information you specified.

I’ve written a product_information function and product_review function to gather the data. Here’s the code for the product_information function.

def product_information(url):
    
    product={}
    options = Options()  
    options.add_argument("--headless")
    options.headless = True
    options.add_argument("--window-size=1920,1200")    
    
    driver = webdriver.Chrome('/usr/local/bin/chromedriver',options=options)
#     driver = webdriver.Chrome('/usr/local/bin/chromedriver')
    driver.get(url)
    
    #scroll down to where the review count is located on the page
    scroll_to = 0
    for i in range(5):
        scroll_to += 300
        driver.execute_script('window.scrollTo(0, ' + str(scroll_to) + ');')
        time.sleep(1.5)
        
    innerHTML = driver.execute_script("return document.body.innerHTML") #use this if java-rendered page
    soup = BeautifulSoup(innerHTML, 'lxml')     
    
#   product name
    product_name= soup.find('h1', {'class': 'product-name qa-product-title'})
    
    if product_name is None:
        product['product_name']=Noneelse:
        product['product_name']=product_name.text
            
#   price 
    price= soup.find('span', {'class': 'product-pricing__retail'})if price is None:
        price= soup.find('span', {'class': 'product-pricing__sale'}) #product-pricing__sale on sale
        
        if price is None:
            product['price']=None 
        else:
            price = price.text
            product['price']=price
    else:
        product['price']=price.text#   product description
#   ui-product-details__description
#   not all the product has description
    product_description= soup.find('div', {'class': 'ui-product-details__description'})
    
    if product_description is None:
        product['product_description']=None
    else: 
        product['product_description']=product_description.text
    
#   product info
#   prod-details-accordion__list
    product_details= soup.find('ul', {'class': 'prod-details-accordion__list'})
    product_details = list(product_details.stripped_strings)
    product['product_details']=product_details   
    
    #tech specs
    product_first=soup.find_all('div', {'class': 'ui-product-details__techspec-row'})
    
    if product_first is None:
        product['tech_spec']=Noneelse:
        tech_spec={}
        
        #should be able to find rows from the tech spec table
        for i in product_first:
            tech_name=i.find('dt', {'class': 'ui-product-details__techspec-name'})
            tech_name=tech_name.text
            tech_value=i.find('dd', {'class': 'ui-product-details__techspec-value'})
            tech_value=tech_value.texttech_spec[tech_name]=tech_valueproduct['tech_spec']=tech_spec
            
    
#   review_count
    review_count= soup.find('span', {'class': 'review-count'})
    if review_count is None:
        product['review_count']=Noneelse:
        review_count = review_count.text
        review_count = int(review_count.split(' ')[0])
        product['review_count']=review_countdriver.close()
    return product

After I gathered the data, I noticed that a lot of the products don’t have any reviews. Therefore, I decided to scrape more data from the top 9 most popular outdoor brands’ best-selling products.

Create a DataFrame

Once we stored the relevant data into a dictionary, we can unpack a nested dictionary to a pandas DataFrame.

product_name=[]
brand_name=[]
price=[]
product_description=[]
product_details=[]
tech_spec=[]
review_count=[]for link in link_list:
    product_name.append(product_dict[link]['description']['product_name'])
    brand_name.append(product_dict[link]['brand_name'])
    price.append(product_dict[link]['description']['price'])
    product_description.append(product_dict[link]['description']['product_description'])
    product_details.append(product_dict[link]['description']['product_details'])
    tech_spec.append(product_dict[link]['description']['tech_spec'])
    review_count.append(product_dict[link]['description']['review_count'])
    
key_list=set()
for idx, spec in enumerate(tech_spec):
    for key in spec.keys():
        key_list.add(key)
key_list=list(key_list)#convert all the unique techs into key
key_dictionary = defaultdict(list)for idx, spec in enumerate(tech_spec):
    for key in key_list:
        if key not in spec.keys():
            key_dictionary[key].append(None)
        else:
            key_dictionary[key].append(spec[key])
            
tech=pd.DataFrame.from_dict(key_dictionary)

Now that we’ve combined all the data into one data frame, in the next step, I did some exploratory data analysis on ScoreFast to better understand the data.

Conclusion

In the next blog, I will explain more about how I built the product recommendation using this dataset.

Thanks for reading, and we hope everyone is staying safe and healthy. We are all hoping we can get back to normal soon. In the meantime, please check out our other  and stay tuned for more!

Related Posts

[et_pb_dfbm_blog dfbm_fullwidth=”off” article_distance=”20″ item_animation=”faded” custom_posttypes=”post” include_post_categories=”248″ posts_number=”3″ post_content_background_color=”#f9f9f9″ content_absolute=”on” content_absolute_event=”off” thumb_same_height=”on” define_thumb_height=”300″ header_tag=”h3″ show_content=”excerpt” show_limit_words=”on” show_comments=”off” show_limit_categories=”on” show_author=”off” show_date=”off” read_more=”off” show_pagination=”off” use_inner_shadow=”off” disabled_on=”off|off|off” _builder_version=”3.19.1″ post_body_line_height=”1.5em” post_header_font=”Montserrat||||||||” post_header_font_size=”17px” post_header_line_height=”1em” post_meta_font=”Montserrat||||||||” post_meta_text_color=”#000000″ post_meta_font_size=”11px” post_meta_line_height=”1.2em” post_meta_link_font=”Montserrat|300|||||||” post_meta_link_text_color=”#000000″ post_meta_link_font_size=”11px” post_meta_link_line_height=”1.2em” post_content_line_height=”1em” text_orientation=”left” custom_css_filt_blogs_post_meta=”font-weight: 200;” custom_css_filt_blogs_post_meta_a=”font-weight: 200;” saved_tabs=”all” /]