Sunday, 24 Dec 2023 at 06:34

December 24, 2023•511 words

A 15 minute scrape

To protect the innocent I have not included the URLs.
This is something that returns the href links which can then be fed into the main scraper.
Why do I even post such a triviality?
To further the case that a labor-as-service platform needs to offer off-the-shelf code priced in 15 minute units or similar. You can then customize it with other 'modules' .

Some modules that might apply in the web scraping context:

Authenticator e,g. JSONtoken 15 minutes *
WriteToTxt. CSV module 15 minutes
Async 30-60 minutes *
ScrapePage 60 minutes (HTML) *
ScrapePage 60 minutes (Selenium)
ScrapePage 15 minutes (JSON)
Documentation 15 minutes
Logging / timing
WriteToDatabase (SQLLITE create tables, upsert to tables, incl. schema) 90 minutes
Deployment*
Interconnecting scrapers*
proxy management*


# 15 minutes task
import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_linksDay(url):
            response = requests.get(url)
            raceUrls=[]
            horseUrls=[]
            soup_object = BeautifulSoup(response.content)
            rowsDates = soup_object.find_all('div', 'RC-alphabetIndexHorseList__race')
            hLinks = soup_object.find_all('div', 'RC-alphabetIndexHorseList__horseName')
            print(len(rowsDates))

            for row in rowsDates:
                        aux=row.find_all('a',href=True)
                        if (len(aux)>0):
                                    if 'cagnes' not in aux[0].get('href'):
                                                    raceUrls.append(aux[0].get('href'))
            for row in hLinks:
                        aux=row.find_all('a',href=True)
                        if (len(aux)>0):
                                if 'cagnes' not in aux[0].get('href'):
                                            horseUrls.append(aux[0].get('href'))
            return horseUrls, raceUrls

# run it
horseUrls, raceUrls=get_linksDay(starturls[0])
print(raceUrls)
print(horseUrls)

B The Problem

The way of the Scrapy has been start async and big with Items and the like. The cleaner way is start small and plug-in modules as needed. The less code, the easier to customize and understand later when you come back to it. TBF this is less of an issue than it once was.

In the above list of modules, the hardest aspect is custom page scraping. if a client provides the item class or struct or DB schema, then it is easier but mapping xpaths to fields, cleaning fields etc is time-consuming and non-automatable as far as I can tell. At least doing it this way.

The main problem and reason for this post is that it is too pricey to ask one person to do all of this when max. 3 of the above tasks really involve any serious brainpower that commands a premium. Also, you want to be trying to do this on a larger non indistrial scale. So I am using a single custom module to get data from a whole range of sources.

C Whats in it for devs?

You think some devs want to use hourly spend to bring in their ideal fee but this is far more transparent. Then, the better devs will charge a higher rate. This means the easier jobs cost more but you use the same developer for the harder modules.

Final word:

C For the Client/ WorkPortal

But this misses the point (I am thinking out loud here). By itemizing work like this you can standardize it. Therefore, you can better compare performance between developers. At the same time you can choose a lower-rated developer for basic tasks and only employ/deploy higher-skilled ones for the skilled tasks. Cheaper overall, and more transparent.