How to turn multiple scraped urls into a single observation?

I am currently scraping an online store to gather data on all its products. This includes the urls for the multiple images that each product has. The code so far is as follows:

for image in subsoup.find_all('a',{'class':'thumb-link'}):
  url.append(image.find('img').get('src'))

This yields a list of urls for a single product, which is then added to a dictionary key:

'URL': ['url1', 'url2', 'url3', 'url4']

What I want to know is how to turn the multiple links for each product into an individual observation which occupies only one row. This way, when I turn the dictionary into a DataFrame and export it as csv, all columns will have the same length.

This is the current output:

| Name                |      URL        |     Price     |
| --------            | --------------  |-------------- |
| Wesson, aceite      |  'url1'         |     10.99     |
|                     |  'url2'         |               |
|                     |  'url3'         |               |
|                     |  'url4'         |               |

This is the kind of output I expect:

| Name                |      URL                           |Price         |
| --------            | --------------                     |--------------
| Wesson, aceite      | ['url1', 'url2', 'url3', 'url4']   |10.99          |

Answer

I am not sure that is the best format to have your urls in. I would be tempted to have one image url per line and repeat the other info.

However, for your desired format, you can simply have a list of lists which you convert to a dataframe at the end. The URL column of a given row would be your urls list. You can use a helper function to return a row for a given url and append that to the global list to convert at end to df e.g.

I have used just one category to obtain some test links. If for all products I would first loop pages, products etc gathering all product page links, then visit those links to get the info.

As this is likely to involve a lot of links you will need to develop error handling, pauses, retries etc.

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

def get_row(soup):
    images = [i['src'] for i in soup.select('.product-thumb-item-img')]
    price = soup.select_one('.priceValue').text.strip()
    title = soup.select_one('#product-display-name').text.strip().replace(u'xa0', ' ')
    return [title, images, price]


results = []
base = 'https://www.pricesmart.com'

with requests.Session() as s:
    test_soup = bs(s.get('https://www.pricesmart.com/site/sv/es/categoria/alimentos?cat=G10D03009').text, 'lxml')
    test_links = [base + i['href'] for i in test_soup.select('.search-product-box a')]
    
    for link in test_links:
        try: 
            r = s.get(link)
            soup = bs(r.content, 'lxml')
            results.append(get_row(soup))
        except:
            print(r, link) 'for initial debugging. You may want to test status code etc... add appropriate error handling
            break
            
df = pd.DataFrame(results, columns = ['Title', 'Url', 'Price'])

Leave a Reply

Your email address will not be published. Required fields are marked *