Web Scraping Basics
Learning Objectives
- By the end of this project, you will be able to:
- - Understand what web scraping is and its applications
- - Install and use BeautifulSoup for HTML parsing
- - Extract data from HTML documents
- - Navigate HTML structure (tags, attributes, text)
- - Handle different HTML elements and structures
- - Use requests library to fetch web pages
- - Understand Scrapy framework basics
- - Practice ethical web scraping
- - Respect robots.txt and rate limiting
- - Handle dynamic content
- - Clean and process scraped data
- - Build practical web scrapers
- - Debug scraping issues
- - Apply best practices for web scraping
Project 2.1: Web Scraping Basics
Learning Objectives
By the end of this project, you will be able to:
- Understand what web scraping is and its applications
- Install and use BeautifulSoup for HTML parsing
- Extract data from HTML documents
- Navigate HTML structure (tags, attributes, text)
- Handle different HTML elements and structures
- Use requests library to fetch web pages
- Understand Scrapy framework basics
- Practice ethical web scraping
- Respect robots.txt and rate limiting
- Handle dynamic content
- Clean and process scraped data
- Build practical web scrapers
- Debug scraping issues
- Apply best practices for web scraping
Introduction to Web Scraping
Web Scraping is the process of extracting data from websites. It involves fetching web pages and parsing their HTML/XML content to extract specific information.
Common Use Cases:
- Data collection for research
- Price monitoring
- News aggregation
- Social media analysis
- Job listings collection
- Product information extraction
Important Considerations:
- Legal: Always check website's Terms of Service
- Ethical: Respect robots.txt and rate limits
- Technical: Handle dynamic content, errors, and edge cases
BeautifulSoup
Installation
# Install BeautifulSoup4
pip install beautifulsoup4
# Install lxml parser (faster)
pip install lxml
# Install requests for fetching pages
pip install requests
# Verify installation
python -c "from bs4 import BeautifulSoup; print('OK')"
Basic Usage
from bs4 import BeautifulSoup
import requests
# Fetch a web page
url = 'https://example.com'
response = requests.get(url)
html = response.text
# Parse HTML
soup = BeautifulSoup(html, 'html.parser')
# Find elements
title = soup.find('title')
print(title.text)
Parsing HTML
from bs4 import BeautifulSoup
# Parse from string
html = '<html><body><h1>Hello World</h1></body></html>'
soup = BeautifulSoup(html, 'html.parser')
# Parse from file
with open('page.html', 'r') as f:
soup = BeautifulSoup(f, 'html.parser')
# Different parsers
soup = BeautifulSoup(html, 'html.parser') # Built-in (slower)
soup = BeautifulSoup(html, 'lxml') # Fast, requires lxml
soup = BeautifulSoup(html, 'html5lib') # Most lenient
Finding Elements
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<h1>Title</h1>
<p class="intro">Introduction</p>
<p class="content">Content here</p>
<a href="https://example.com">Link</a>
</body>
</html>
'''
soup = BeautifulSoup(html, 'html.parser')
# Find first element
first_p = soup.find('p')
print(first_p.text) # Introduction
# Find all elements
all_p = soup.find_all('p')
for p in all_p:
print(p.text)
# Find by class
intro = soup.find('p', class_='intro')
print(intro.text)
# Find by id
# element = soup.find(id='myid')
# Find by attribute
link = soup.find('a', href='https://example.com')
print(link['href'])
Navigating HTML Structure
from bs4 import BeautifulSoup
html = '''
<div class="container">
<h1>Title</h1>
<div class="content">
<p>Paragraph 1</p>
<p>Paragraph 2</p>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
container = soup.find('div', class_='container')
# Parent
parent = container.parent
# Children
for child in container.children:
print(child)
# Descendants
for descendant in container.descendants:
print(descendant)
# Siblings
next_sibling = container.next_sibling
prev_sibling = container.previous_sibling
# Find parent
content = soup.find('div', class_='content')
parent = content.find_parent('div', class_='container')
Extracting Data
from bs4 import BeautifulSoup
html = '''
<div class="product">
<h2>Product Name</h2>
<span class="price">$29.99</span>
<p class="description">Product description</p>
<a href="/product/123">View</a>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
product = soup.find('div', class_='product')
# Extract text
name = product.find('h2').text.strip()
price = product.find('span', class_='price').text.strip()
description = product.find('p', class_='description').text.strip()
# Extract attribute
link = product.find('a')['href']
# Extract all text
all_text = product.get_text(separator=' ', strip=True)
print(f'Name: {name}')
print(f'Price: {price}')
print(f'Description: {description}')
print(f'Link: {link}')
Working with Attributes
from bs4 import BeautifulSoup
html = '<a href="https://example.com" class="link" data-id="123">Link</a>'
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a')
# Get attribute
href = link['href']
link_class = link.get('class')
data_id = link.get('data-id', 'default')
# Get all attributes
attrs = link.attrs
print(attrs) # {'href': 'https://example.com', 'class': ['link'], 'data-id': '123'}
# Check if attribute exists
if 'href' in link.attrs:
print('Has href attribute')
CSS Selectors
from bs4 import BeautifulSoup
html = '''
<div class="container">
<p class="intro">Introduction</p>
<p class="content">Content</p>
<div id="sidebar">
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
# Select by CSS selector
intro = soup.select_one('p.intro')
all_p = soup.select('p')
sidebar = soup.select('#sidebar')
items = soup.select('ul li')
# Complex selectors
first_item = soup.select_one('div#sidebar ul li:first-child')
Using requests Library
Fetching Web Pages
import requests
# Basic GET request
url = 'https://example.com'
response = requests.get(url)
# Check status
print(response.status_code) # 200
print(response.ok) # True
# Get content
html = response.text
content = response.content
# Headers
print(response.headers)
# URL
print(response.url)
Handling Errors
import requests
from requests.exceptions import RequestException
url = 'https://example.com'
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raises exception for bad status codes
html = response.text
except requests.exceptions.Timeout:
print('Request timed out')
except requests.exceptions.ConnectionError:
print('Connection error')
except requests.exceptions.HTTPError as e:
print(f'HTTP error: {e}')
except RequestException as e:
print(f'Request error: {e}')
Adding Headers
import requests
url = 'https://example.com'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
response = requests.get(url, headers=headers)
Session Management
import requests
# Create session (maintains cookies)
session = requests.Session()
# Set headers for all requests
session.headers.update({
'User-Agent': 'MyBot/1.0'
})
# Make requests
response1 = session.get('https://example.com/login')
response2 = session.get('https://example.com/dashboard') # Cookies maintained
Complete Scraping Example
Simple Web Scraper
#!/usr/bin/env python3
"""
Simple Web Scraper
Scrapes title and links from a webpage
"""
import requests
from bs4 import BeautifulSoup
import time
def scrape_page(url):
"""Scrape a single page."""
try:
# Fetch page
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data
title = soup.find('title')
title_text = title.text.strip() if title else 'No title'
links = []
for link in soup.find_all('a', href=True):
links.append({
'text': link.text.strip(),
'url': link['href']
})
return {
'title': title_text,
'links': links,
'url': url
}
except Exception as e:
print(f'Error scraping {url}: {e}')
return None
def main():
url = 'https://example.com'
data = scrape_page(url)
if data:
print(f"Title: {data['title']}")
print(f"\nFound {len(data['links'])} links:")
for link in data['links'][:10]: # Show first 10
print(f" {link['text']} -> {link['url']}")
if __name__ == '__main__':
main()
Advanced Scraper with Data Extraction
#!/usr/bin/env python3
"""
Advanced Web Scraper
Extracts structured data from a webpage
"""
import requests
from bs4 import BeautifulSoup
import json
import csv
from typing import List, Dict
class WebScraper:
def __init__(self, base_url: str):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
def fetch_page(self, url: str) -> BeautifulSoup:
"""Fetch and parse a webpage."""
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, 'html.parser')
except Exception as e:
print(f'Error fetching {url}: {e}')
return None
def extract_products(self, soup: BeautifulSoup) -> List[Dict]:
"""Extract product information from page."""
products = []
# Find all product containers (adjust selector as needed)
product_divs = soup.find_all('div', class_='product')
for div in product_divs:
try:
product = {
'name': div.find('h2').text.strip() if div.find('h2') else '',
'price': div.find('span', class_='price').text.strip() if div.find('span', class_='price') else '',
'description': div.find('p', class_='description').text.strip() if div.find('p', class_='description') else '',
'link': div.find('a')['href'] if div.find('a') else ''
}
products.append(product)
except Exception as e:
print(f'Error extracting product: {e}')
continue
return products
def save_to_json(self, data: List[Dict], filename: str):
"""Save data to JSON file."""
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f'Data saved to {filename}')
def save_to_csv(self, data: List[Dict], filename: str):
"""Save data to CSV file."""
if not data:
return
fieldnames = data[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f'Data saved to {filename}')
def main():
scraper = WebScraper('https://example.com')
# Fetch page
soup = scraper.fetch_page('https://example.com/products')
if soup:
# Extract data
products = scraper.extract_products(soup)
# Save data
scraper.save_to_json(products, 'products.json')
scraper.save_to_csv(products, 'products.csv')
print(f'\nExtracted {len(products)} products')
if __name__ == '__main__':
main()
Scrapy Introduction
What is Scrapy?
Scrapy is a powerful Python framework for web scraping. It's more advanced than BeautifulSoup and better suited for large-scale scraping projects.
Key Features:
- Built-in support for crawling multiple pages
- Automatic request scheduling
- Built-in data export (JSON, CSV, XML)
- Middleware support
- Extensible architecture
Installation
# Install Scrapy
pip install scrapy
# Verify installation
scrapy --version
Basic Scrapy Project
# Create a new Scrapy project
scrapy startproject myproject
# This creates:
# myproject/
# scrapy.cfg
# myproject/
# __init__.py
# items.py
# middlewares.py
# pipelines.py
# settings.py
# spiders/
# __init__.py
Creating a Spider
# myproject/spiders/example_spider.py
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['https://example.com']
def parse(self, response):
# Extract data
title = response.css('title::text').get()
links = response.css('a::attr(href)').getall()
yield {
'title': title,
'links': links,
'url': response.url
}
# Follow links
for link in response.css('a::attr(href)').getall():
yield response.follow(link, self.parse)
Running Scrapy
# Run spider
scrapy crawl example
# Save to JSON
scrapy crawl example -o output.json
# Save to CSV
scrapy crawl example -o output.csv
# Follow specific settings
scrapy crawl example -s DOWNLOAD_DELAY=1
Scrapy Items
# myproject/items.py
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
description = scrapy.Field()
url = scrapy.Field()
Using Items in Spider
from myproject.items import ProductItem
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product_div in response.css('div.product'):
item = ProductItem()
item['name'] = product_div.css('h2::text').get()
item['price'] = product_div.css('span.price::text').get()
item['description'] = product_div.css('p.description::text').get()
item['url'] = response.url
yield item
Ethical Web Scraping
robots.txt
import urllib.robotparser
def check_robots_txt(url):
"""Check if URL is allowed by robots.txt."""
rp = urllib.robotparser.RobotFileParser()
rp.set_url(f'{url}/robots.txt')
rp.read()
user_agent = 'MyBot'
if rp.can_fetch(user_agent, url):
print(f'Allowed to scrape {url}')
return True
else:
print(f'Not allowed to scrape {url}')
return False
# Usage
check_robots_txt('https://example.com')
Rate Limiting
import time
import requests
class RateLimitedScraper:
def __init__(self, delay=1.0):
self.delay = delay
self.last_request_time = 0
def fetch(self, url):
"""Fetch URL with rate limiting."""
# Wait if needed
elapsed = time.time() - self.last_request_time
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
response = requests.get(url)
self.last_request_time = time.time()
return response
# Usage
scraper = RateLimitedScraper(delay=2.0) # 2 seconds between requests
response = scraper.fetch('https://example.com')
Best Practices
#!/usr/bin/env python3
"""
Ethical Web Scraper
Follows best practices for web scraping
"""
import requests
from bs4 import BeautifulSoup
import time
import urllib.robotparser
from urllib.parse import urljoin, urlparse
class EthicalScraper:
def __init__(self, base_url, delay=1.0, user_agent='MyBot/1.0'):
self.base_url = base_url
self.delay = delay
self.user_agent = user_agent
self.session = requests.Session()
self.session.headers.update({'User-Agent': user_agent})
self.last_request_time = 0
self.robots_parser = None
self._check_robots_txt()
def _check_robots_txt(self):
"""Check robots.txt."""
try:
robots_url = urljoin(self.base_url, '/robots.txt')
self.robots_parser = urllib.robotparser.RobotFileParser()
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
print(f'Loaded robots.txt from {robots_url}')
except Exception as e:
print(f'Could not load robots.txt: {e}')
self.robots_parser = None
def can_fetch(self, url):
"""Check if URL can be fetched according to robots.txt."""
if self.robots_parser:
return self.robots_parser.can_fetch(self.user_agent, url)
return True
def fetch(self, url):
"""Fetch URL with rate limiting and robots.txt check."""
# Check robots.txt
if not self.can_fetch(url):
print(f'Skipping {url} (disallowed by robots.txt)')
return None
# Rate limiting
elapsed = time.time() - self.last_request_time
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
self.last_request_time = time.time()
return response
except Exception as e:
print(f'Error fetching {url}: {e}')
return None
def scrape(self, url):
"""Scrape a single page."""
response = self.fetch(url)
if not response:
return None
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data (customize as needed)
data = {
'url': url,
'title': soup.find('title').text.strip() if soup.find('title') else '',
'content': soup.get_text(separator=' ', strip=True)[:500] # First 500 chars
}
return data
def main():
scraper = EthicalScraper('https://example.com', delay=2.0)
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
for url in urls:
data = scraper.scrape(url)
if data:
print(f"Scraped: {data['title']}")
time.sleep(1) # Additional delay between pages
if __name__ == '__main__':
main()
Legal and Ethical Guidelines
- Check Terms of Service: Always read and respect website's ToS
- Respect robots.txt: Follow robots.txt directives
- Rate Limiting: Don't overload servers with requests
- Identify Yourself: Use proper User-Agent headers
- Don't Scrape Personal Data: Be careful with personal information
- Respect Copyright: Don't republish copyrighted content
- Use APIs When Available: Prefer official APIs over scraping
- Be Transparent: If scraping for research, be transparent about it
Practice Exercise
Exercise: Web Scraper
Objective: Create a web scraper that extracts information from a website.
Requirements:
-
Create a scraper that:
- Fetches web pages using requests
- Parses HTML using BeautifulSoup
- Extracts specific data (e.g., news headlines, product info)
- Saves data to JSON or CSV
- Implements rate limiting
- Checks robots.txt
-
Features:
- Error handling
- Rate limiting
- Data cleaning
- Export functionality
Example Solution:
#!/usr/bin/env python3
"""
News Headlines Scraper
Scrapes news headlines from a news website
"""
import requests
from bs4 import BeautifulSoup
import json
import csv
import time
import urllib.robotparser
from urllib.parse import urljoin
from typing import List, Dict
class NewsScraper:
def __init__(self, base_url, delay=2.0):
self.base_url = base_url
self.delay = delay
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
self.last_request_time = 0
self._check_robots_txt()
def _check_robots_txt(self):
"""Check robots.txt."""
try:
robots_url = urljoin(self.base_url, '/robots.txt')
self.robots_parser = urllib.robotparser.RobotFileParser()
self.robots_parser.set_url(robots_url)
self.robots_parser.read()
print(f'✓ Loaded robots.txt')
except Exception as e:
print(f'⚠ Could not load robots.txt: {e}')
self.robots_parser = None
def can_fetch(self, url):
"""Check if URL is allowed."""
if self.robots_parser:
return self.robots_parser.can_fetch('*', url)
return True
def _rate_limit(self):
"""Implement rate limiting."""
elapsed = time.time() - self.last_request_time
if elapsed < self.delay:
time.sleep(self.delay - elapsed)
self.last_request_time = time.time()
def fetch_page(self, url):
"""Fetch a webpage."""
if not self.can_fetch(url):
print(f'⚠ Skipping {url} (disallowed by robots.txt)')
return None
self._rate_limit()
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return response
except Exception as e:
print(f'✗ Error fetching {url}: {e}')
return None
def extract_headlines(self, soup):
"""Extract headlines from page."""
headlines = []
# Adjust selectors based on actual website structure
# This is a generic example
article_elements = soup.find_all('article') or soup.find_all('div', class_='article')
for element in article_elements:
try:
headline = {
'title': '',
'link': '',
'summary': '',
'date': ''
}
# Extract title
title_elem = element.find('h2') or element.find('h3') or element.find('a')
if title_elem:
headline['title'] = title_elem.get_text(strip=True)
if title_elem.name == 'a' and title_elem.get('href'):
headline['link'] = urljoin(self.base_url, title_elem['href'])
# Extract summary
summary_elem = element.find('p') or element.find('div', class_='summary')
if summary_elem:
headline['summary'] = summary_elem.get_text(strip=True)[:200]
# Extract date
date_elem = element.find('time') or element.find('span', class_='date')
if date_elem:
headline['date'] = date_elem.get_text(strip=True)
if headline['title']:
headlines.append(headline)
except Exception as e:
print(f'Error extracting headline: {e}')
continue
return headlines
def scrape(self, url):
"""Scrape headlines from a page."""
response = self.fetch_page(url)
if not response:
return []
soup = BeautifulSoup(response.text, 'html.parser')
headlines = self.extract_headlines(soup)
return headlines
def save_json(self, data, filename):
"""Save data to JSON."""
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f'✓ Saved {len(data)} items to {filename}')
def save_csv(self, data, filename):
"""Save data to CSV."""
if not data:
return
fieldnames = data[0].keys()
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data)
print(f'✓ Saved {len(data)} items to {filename}')
def main():
# Example: Scrape from a news website
# Replace with actual URL
base_url = 'https://example-news.com'
scraper = NewsScraper(base_url, delay=2.0)
# Scrape main page
print(f'Scraping {base_url}...')
headlines = scraper.scrape(base_url)
if headlines:
print(f'\n✓ Extracted {len(headlines)} headlines')
# Display first few
print('\nFirst 5 headlines:')
for i, headline in enumerate(headlines[:5], 1):
print(f'\n{i}. {headline["title"]}')
if headline.get('summary'):
print(f' {headline["summary"]}')
# Save data
scraper.save_json(headlines, 'headlines.json')
scraper.save_csv(headlines, 'headlines.csv')
else:
print('No headlines found')
if __name__ == '__main__':
main()
Expected Output: A complete web scraper that extracts headlines and saves them to JSON and CSV files.
Challenge (Optional):
- Scrape multiple pages
- Handle pagination
- Extract images
- Handle dynamic content with Selenium
- Add data validation
- Create a CLI interface
Key Takeaways
- Web Scraping - Extracting data from websites
- BeautifulSoup - HTML parsing library
- requests - HTTP library for fetching pages
- HTML Navigation - Finding and extracting elements
- CSS Selectors - Powerful element selection
- Scrapy - Advanced scraping framework
- Ethical Scraping - Respect robots.txt and rate limits
- Rate Limiting - Don't overload servers
- Error Handling - Handle network and parsing errors
- Data Export - Save to JSON, CSV, etc.
- Best Practices - Legal, ethical, technical considerations
- Session Management - Maintain cookies and headers
- Data Cleaning - Process and clean scraped data
- Robots.txt - Check before scraping
- User-Agent - Identify your scraper properly
Quiz: Web Scraping
Test your understanding with these questions:
-
What library is commonly used for HTML parsing?
- A) requests
- B) BeautifulSoup
- C) urllib
- D) scrapy
-
What fetches web pages?
- A) requests.get()
- B) BeautifulSoup()
- C) urllib.fetch()
- D) scrapy.get()
-
What finds the first element?
- A) soup.find()
- B) soup.find_all()
- C) soup.select()
- D) soup.get()
-
What finds all elements?
- A) soup.find()
- B) soup.find_all()
- C) soup.select_all()
- D) soup.get_all()
-
What extracts text from element?
- A) element.text
- B) element.get_text()
- C) element.string
- D) All of the above
-
What is robots.txt?
- A) File that allows/disallows scraping
- B) File with robot instructions
- C) Scraping rules file
- D) All of the above
-
What is rate limiting?
- A) Limiting request speed
- B) Delaying between requests
- C) Respecting server resources
- D) All of the above
-
What is Scrapy?
- A) HTML parser
- B) Web scraping framework
- C) HTTP library
- D) Browser automation
-
What should you check before scraping?
- A) robots.txt
- B) Terms of Service
- C) Rate limits
- D) All of the above
-
What is ethical scraping?
- A) Following robots.txt
- B) Rate limiting
- C) Respecting ToS
- D) All of the above
Answers:
- B) BeautifulSoup (HTML parsing library)
- A) requests.get() (fetch web pages)
- A) soup.find() (find first element)
- B) soup.find_all() (find all elements)
- D) All of the above (text extraction methods)
- D) All of the above (robots.txt purpose)
- D) All of the above (rate limiting)
- B) Web scraping framework (Scrapy)
- D) All of the above (pre-scraping checks)
- D) All of the above (ethical scraping)
Next Steps
Excellent work! You've mastered web scraping basics. You now understand:
- BeautifulSoup for HTML parsing
- Scrapy introduction
- Ethical scraping practices
- How to build web scrapers
What's Next?
- Project 3.1: SQLite with Python
- Learn database operations
- Work with SQLite
- Build database applications
Additional Resources
- BeautifulSoup Documentation: www.crummy.com/software/BeautifulSoup/bs4/doc/
- Scrapy Documentation: docs.scrapy.org/
- requests Documentation: docs.python-requests.org/
- Web Scraping Ethics: Always respect website terms and robots.txt
Project completed! You're ready to move on to the next project.
Course Navigation
- Web Scraping Basics