Back to Glossary Index

Dagster Data Engineering Glossary:


Data Scraping

Extract data from a website or another source.

Data Scraping definition:

Data Scraping refers to the process of extracting data from a website or another source, typically to transform and store it in a structured form. While the concept has existed for a long time, it's become increasingly relevant in the context of modern data engineering due to the vast amounts of data available on the web and the value that can be derived from analyzing that data.

Is data scraping legitimate, or just a workaround?

Many engineers would argue the term "scraping" infers a workaround when no formal method for making the data available exists (such as an API - or Application Programming Interface - or some form of export or download) . Whether data scraping is a workaround or a legitimate approach to data acquisition depends on the context, the specific use case, and, often, the legal or ethical considerations involved:

Legitimate Uses:

  • Research: Academics and researchers might scrape data from the web to conduct research, analyze trends, or gather datasets that are not otherwise available.
  • Business Intelligence: Companies might scrape data to analyze market trends, track competitors' prices, or gather information that is publicly available but not readily accessible in structured formats.
  • Intelligent Process automation: Engineers may devise ways of scraping data from unformatted or legacy documents that were never designed for digital processing.

Workaround Uses:

  • API Limitations: Sometimes, websites offer an API to access their data, but there might be limitations regarding the rate of requests, the volume of data, or the type of data available. In these cases, scraping might be seen as a workaround to bypass such limitations.
  • No API Availability: If a website doesn't provide an API or structured data access, scraping is the method to get the data programmatically. Again, check the website's terms and conditions to see if scraping is a violation.

is data scraping only good for web content?

Data scraping is often thought to be synonymous with "web scraping" but it is not limited to just web content. While web scraping is one of the most talked-about forms of data scraping due to the vast amount of publicly accessible data on the internet, scraping as a concept is much broader and can be applied wherever there's data that needs to be extracted. Here are some other contexts where data scraping is applicable:

  1. File Systems: Data can be scraped from a collection of documents, spreadsheets, and other files. For instance, you might scrape text from a collection of PDFs or pull specific data from Excel files.

  2. Databases: You can "scrape" or extract data from databases, although the term more commonly used in this context is "query" or "extract."

  3. Emails: Data can be scraped from email bodies, attachments, or headers for various purposes like surveillance, analytics, or automations.

  4. Images and Videos: With the help of OCR (Optical Character Recognition) tools, text can be scraped from images. Similarly, specific data or patterns can be extracted from videos.

  5. Software Applications: Data can be scraped from software applications that don't offer an export feature but display relevant data on their GUI. Tools that simulate mouse and keyboard actions (like RPA - Robotic Process Automation tools) can be used in such scenarios.

  6. Network Traffic: In cybersecurity and network management, data packets can be "scraped" or captured to analyze the content of network traffic.

Data scraping use cases:

You will find data scraping steps in the data pipelines used in many different scenarios:

  • Market Research: Scraping product prices from various online retailers for competitive analysis, as mentioned earlier.
  • Social Media Analysis: Extracting user reviews, comments, and ratings for sentiment analysis.
  • Job Boards: Scraping job postings to analyze labor market trends.
  • Real Estate: Gathering property listings and prices for market trend analysis.

the challenges of data scraping as a technique:

  • Legal Concerns: Not all websites allow data scraping. It's essential to review the website’s robots.txt file and the terms of service to ensure compliance.
  • Dynamic Content: Websites using JavaScript to load content can be challenging to scrape using basic methods. Frameworks that can mimic browser behavior, like Selenium, are used in such cases.
  • Rate Limiting: Websites might limit the frequency of access to prevent scraping. Designing respectful and non-intrusive scrapers is essential.
  • Data Quality: Scraped data can be messy and might require considerable cleaning and preprocessing.

Ethics & Best Practices in data scraping:

  • Respect robots.txt: This is a standard used by websites to communicate which parts should not be scraped or crawled.
  • Avoid Overloading Servers: Introduce delays between requests to avoid unintentionally performing a Denial-of-Service attack.
  • User-Agent Headers: Be transparent about who you are and why you are scraping. Some websites block default user-agent strings used by scraping libraries.

In the broader context of data engineering, data scraping is just one component. Once data is acquired, it has to be cleaned, transformed, stored, and finally analyzed. As technology and the web evolve, data scraping techniques and tools will also adapt, and many technology companies are growing increasingly protective of their web content.

Python tools for data scraping

Python is a tool of choice for building data scraping applications. It provides many off-the-shelf tools for this task:

  • Libraries: Python libraries such as Beautiful Soup, Scrapy, and Selenium are popular choices for web scraping.
  • Cloud Services: AWS Lambda, Google Cloud Functions can be used to execute scraping tasks at scale.
  • Scraping Platforms: Tools like Octoparse, Import.io, and WebHarvy offer user-friendly GUIs for web scraping without coding.

A practical example of data scraping in Python

By way of illustration, let's scrape the real-time weather data from the wttr.in website, which provides weather information in a simple text format. It's also friendly for developer usage, making it a suitable candidate for this example.

Note: While the "wttr.in" service is developer-friendly, try not to make too many requests to any service, as it may lead to your IP being rate-limited or banned. Always read and follow the terms of service or usage policies of any external service you interact with programmatically.

To do the scraping we will use BeautifulSoup, a Python library for pulling data out of HTML and XML files.

The BeautifulSoup project logo

Please note that you need to have the necessary Python libraries installed in your Python environment to run this code.

pip install beautifulsoup4 requests

Then run the following code:

import requests
from bs4 import BeautifulSoup

BASE_URL = "https://wttr.in/"

def fetch_weather(city):
    # Construct the URL based on the desired city
    url = f"{BASE_URL}{city}?format=%C+%t"

    # Make the request
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for HTTP errors

    # The content will already be in a simple text format: "<condition> +<temperature>"
    # No need for BeautifulSoup parsing in this case
    condition, temperature = response.text.split('+')

    return {
        "city": city,
        "condition": condition.strip(),
        "temperature": temperature.strip()
    }

# Example Usage:
cities = ["Philadelphia","San Francisco","London","Paris"]
for city in cities:
    weather_info = fetch_weather(city)
    print(weather_info)

When you run the above code, you might get an output similar to:

{'city': 'Philadelphia', 'condition': 'Overcast', 'temperature': '90°F'}
{'city': 'San Francisco', 'condition': 'Partly cloudy', 'temperature': '69°F'}
{'city': 'London', 'condition': 'Overcast', 'temperature': '66°F'}
{'city': 'Paris', 'condition': 'Clear', 'temperature': '70°F'}

Yes, it is very hot right now in Philadelphia!


Other data engineering terms related to
Data Storage and Retrieval: