GUARDLABS
GuardLabs ยท Technical note

Python Script to Download All Images from a Website

This guide provides a concise Python script to programmatically download images from a specified website. It covers fetching HTML, parsing for image URLs, and saving files locally.

Prerequisites

Ensure you have Python installed (version 3.6+ recommended). The script relies on the following libraries:

  • requests: For making HTTP requests to fetch webpage content and images.
  • BeautifulSoup4 (bs4): For parsing HTML and extracting image URLs.
  • os: For interacting with the operating system, specifically for creating directories and manipulating file paths.
  • urllib.parse: For handling URL parsing, joining, and ensuring absolute URLs.

Install them using pip if you haven't already:

pip install requests beautifulsoup4

Core Script Implementation

The following Python script defines a function to download images. You will need to specify the target URL and a local directory for saving the downloaded files.

import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin, urlparse

def download_images_from_website(url, output_dir="downloaded_images"):
    """
    Downloads all images from a given URL to a specified local directory.

    Args:
        url (str): The URL of the website to scrape.
        output_dir (str): The local directory to save images.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Created directory: {output_dir}")

    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise an HTTPError for bad responses (4xx or 5xx)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL {url}: {e}")
        return

    soup = BeautifulSoup(response.text, 'html.parser')
    img_tags = soup.find_all('img')

    downloaded_count = 0
    for img in img_tags:
        img_url = img.get('src') or img.get('data-src') # Check 'src' and 'data-src'

        if not img_url:
            continue

        # Construct absolute URL
        absolute_img_url = urljoin(url, img_url)
        
        # Extract filename from URL
        parsed_url = urlparse(absolute_img_url)
        image_name = os.path.basename(parsed_url.path)

        # Basic sanitization for filename
        if not image_name or '.' not in image_name:
            # If no clear filename, generate one based on hash or counter
            image_name = f"image_{downloaded_count}.jpg" # Fallback

        # Ensure unique filenames in case of duplicates
        base, ext = os.path.splitext(image_name)
        counter = 0
        final_image_path = os.path.join(output_dir, image_name)
        while os.path.exists(final_image_path):
            counter += 1
            final_image_path = os.path.join(output_dir, f"{base}_{counter}{ext}")

        try:
            img_response = requests.get(absolute_img_url, stream=True, timeout=10)
            img_response.raise_for_status()

            with open(final_image_path, 'wb') as f:
                for chunk in img_response.iter_content(chunk_size=8192):
                    f.write(chunk)
            print(f"Downloaded: {image_name} to {final_image_path}")
            downloaded_count += 1
        except requests.exceptions.RequestException as e:
            print(f"Error downloading image {absolute_img_url}: {e}")
        except IOError as e:
            print(f"Error saving image {image_name}: {e}")

    print(f"\nFinished downloading. Total images downloaded: {downloaded_count}")

# --- Example Usage ---
if __name__ == "__main__":
    target_url = "https://www.example.com" # Replace with your target website
    output_directory = "my_website_images"
    download_images_from_website(target_url, output_directory)

Important Considerations

Respect robots.txt

Always check a website's robots.txt file (e.g., https://example.com/robots.txt) before scraping. This file outlines the website's policies regarding automated access and indicates which parts of the site should not be crawled. Adhering to these guidelines is crucial for ethical scraping.

Rate Limiting

Aggressive scraping can lead to your IP address being temporarily or permanently blocked by the website's server. To avoid this, implement delays (e.g., time.sleep() from the time module) between requests to mimic human browsing behavior and reduce server load. For example, add import time and then time.sleep(1) after each image download.

Dynamic Content

This script primarily targets images linked directly in the HTML <img> tags. Websites that load images dynamically via JavaScript (e.g., single-page applications or lazy-loading techniques) may not have these URLs present in the initial HTML response. For such cases, a more advanced solution like Selenium, which renders the webpage in a browser, would be required before parsing.

Alternative Image Sources

Images can also be embedded via CSS background properties, <picture>/<source> tags, or srcset attributes for responsive images. The provided script focuses on standard <img src="..."> and data-src attributes. Expanding to capture images from these other sources would require additional parsing logic and potentially more complex URL extraction.

Error Handling

The provided script includes basic error handling for HTTP requests and file operations. For robust production-level scripts, consider more comprehensive error management, including retries for transient network issues, logging, and more detailed reporting on failed downloads.

This script provides a foundational approach. Adaptations and further development may be necessary based on the specific website's structure, policies, and your project requirements.

Need this done fast? order it on Kwork.

Published 2026-06-23 3 min read All articles EN / RU / ES
Need help with this?

I take on freelance fixes and builds in this area.