Research Guides: Python in Digital Scholarship: Web Scraping

Web Scraping with Python

Web scraping refers to the automated retrieval and extraction of data from websites. In academic contexts, it is often used to collect publicly available information for research, such as news articles, policy documents, or metadata from institutional repositories. Python is particularly well-suited for web scraping due to its readability and the availability of robust libraries.

Common Use Cases

Collecting data for digital humanities projects (e.g., scraping newspaper archives)
Aggregating open government datasets
Extracting metadata from online catalogs or repositories
Compiling text corpora for NLP or linguistic analysis
Monitoring changes to online content over time

Key Libraries

requests – A simple HTTP library for sending GET/POST requests and retrieving webpage content.
BeautifulSoup (bs4) – Parses HTML or XML documents and enables easy navigation and extraction of data using tag-based queries.
lxml – A fast HTML/XML parser (optional backend for BeautifulSoup).
Selenium – Automates web browsers to extract content from JavaScript-heavy sites (requires a web driver).
Scrapy – A full-featured web crawling and scraping framework for large-scale projects.

Basic Workflow

Send a request to the webpage and retrieve the HTML content.
Parse the HTML using a parser like BeautifulSoup.
Locate and extract the target data using HTML tags, classes, or IDs.
Store the data (e.g., in CSV, JSON, or a database).

Example Code

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Step 3: Extract data
titles = soup.find_all("h2", class_="article-title")
for title in titles:
    print(title.get_text())

This script retrieves the HTML from books.toscrape.com, parses it, and prints the text content of all <h2> tags with the class article-title.

Best Practices

Respect robots.txt: Check https://example.com/robots.txt to see what the site permits for scraping.
Avoid overloading servers: Use time delays between requests (time.sleep()) and limit frequency.
Cite scraped data: Clearly document the source and date of data collection for reproducibility.
Use APIs when available: Many sites offer structured data via APIs, which are preferable to scraping when accessible.