Skip to Main Content

Python in Digital Scholarship

This guide will provide an introduction to using Python in research and instruction and what resources are available in the Freedman Center.

Web Scraping with Python

Web scraping refers to the automated retrieval and extraction of data from websites. In academic contexts, it is often used to collect publicly available information for research, such as news articles, policy documents, or metadata from institutional repositories. Python is particularly well-suited for web scraping due to its readability and the availability of robust libraries.

Common Use Cases

  • Collecting data for digital humanities projects (e.g., scraping newspaper archives)

  • Aggregating open government datasets

  • Extracting metadata from online catalogs or repositories

  • Compiling text corpora for NLP or linguistic analysis

  • Monitoring changes to online content over time

Key Libraries

  • requests – A simple HTTP library for sending GET/POST requests and retrieving webpage content.

  • BeautifulSoup (bs4) – Parses HTML or XML documents and enables easy navigation and extraction of data using tag-based queries.

  • lxml – A fast HTML/XML parser (optional backend for BeautifulSoup).

  • Selenium – Automates web browsers to extract content from JavaScript-heavy sites (requires a web driver).

  • Scrapy – A full-featured web crawling and scraping framework for large-scale projects.

Basic Workflow

  1. Send a request to the webpage and retrieve the HTML content.

  2. Parse the HTML using a parser like BeautifulSoup.

  3. Locate and extract the target data using HTML tags, classes, or IDs.

  4. Store the data (e.g., in CSV, JSON, or a database).

Example Code

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch the webpage
url = "https://books.toscrape.com/"
response = requests.get(url)
html = response.text

# Step 2: Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Step 3: Extract data
titles = soup.find_all("h2", class_="article-title")
for title in titles:
    print(title.get_text())

This script retrieves the HTML from books.toscrape.com, parses it, and prints the text content of all <h2> tags with the class article-title.

Best Practices

  • Respect robots.txt: Check https://example.com/robots.txt to see what the site permits for scraping.

  • Avoid overloading servers: Use time delays between requests (time.sleep()) and limit frequency.

  • Cite scraped data: Clearly document the source and date of data collection for reproducibility.

  • Use APIs when available: Many sites offer structured data via APIs, which are preferable to scraping when accessible.