Speed up web scraping by blocking unnecessary resources

When scraping a website, you might wonder if blocking resources could speed things up. Running your scraper in headless mode skips rendering visual elements like CSS, images, and animations, making scraping faster. However, there's a catch: headless mode often triggers anti-bot measures like Cloudflare or CAPTCHAs, which can stop your scraper in its tracks.
The idea of blocking resources is still valuable. You reduce bandwidth usage and improve loading times by skipping unnecessary assets such as images, videos, or ads. But is there a way to achieve this without relying solely on headless mode? Thankfully, yes!
With Axiom.ai, you can use the JavaScript step (opens new window) to run JS directly in the browser or app. Puppeteer is a built-in dependency, meaning you can leverage its library of functions to optimize and control the browser.
As a no-coder, I had never tried blocking resources, let alone with code. However, with modern generative AI tools, no-coders like me can now generate working code, boosting our skills while having fun. So, I'm going to give it a go!
In this guide, I'll explore:
- Using simple JavaScript and Puppeteer commands.
- Blocking specific resources like images, videos, and JavaScript.
The goal? To learn handy new techniques to speed up web scraping while keeping it simple. We want to use as few lines of code as possible, so it's easy for no-coders like myself to copy and paste.
Have I used Puppeteer (opens new window) before? Nope, never!
# What resources should you intercept?
To speed up your scraping, consider intercepting the following resources. Videos and adverts are often the most significant contributors to slower performance and tracking apps.
- Videos: These are bandwidth-heavy and often unnecessary for scraping.
- Adverts: Advertisements slow down loading times and unnecessarily consume resources.
- Images: Images can be blocked if you need text-based content.
- CSS: You don’t need stylesheets for data scraping, but can you block them without causing issues?
- Third-party tracking apps: Analytics and tracking scripts can be safely blocked.
- JavaScript: Disabling JavaScript sometimes improves speed, but beware, many sites rely on JavaScript for core functionality.
- Fonts: Custom fonts add unnecessary weight to page loads and can be safely blocked.
# How do you identify resources to block?
There are a few methods and tools you can use to identify which resources to block. To keep it simple use Chrome DevTools to examine resource paths such as headers, image URLs, and scripts.
To get started:
- Open Chrome DevTools:
- Right-click on the webpage and select Inspect, or use the shortcut
Ctrl + Shift + I(Windows/Linux) orCmd + Option + I(Mac).
- Right-click on the webpage and select Inspect, or use the shortcut
- Navigate to the Network tab:
- This will show all the resources the browser is loading, including images, CSS, JavaScript, and third-party scripts.
- Filter by resource type:
- Use the filter options (e.g., "Images," "Scripts," "Media") to find the specific assets you might want to block.
- Analyze resource sizes:
- Check the "Size" column to identify large files like videos or unnecessary scripts that might slow down your scraper.
If you're new to Chrome DevTools (opens new window), plenty of tutorials are available online to help you get familiar with its features.
# Can I break something?
Yes, blocking certain resources can break a website's functionality. For example, blocking critical dependencies like JavaScript files or APIs can prevent the site from working properly. This could result in incomplete or missing data during scraping.
To avoid this, test your setup carefully:
- Start by blocking non-essential resources like images, videos, and ads.
- Gradually experiment with blocking more resources like CSS or JavaScript, ensuring the website still loads the content you need.
# First I will try standard Javascript
I already knew I could hide elements by applying a CSS class with JavaScript. For example, with a single line of code like this, I can hide the header of a website:
document.getElementById('header-content').style.display = 'none';
First, I tried using this simple JavaScript snippet (opens new window) to hide images, videos, and other elements. Now, a coder would have told me straight would not work! They would be right. There was no difference in the page load speed; why? The code was applied after the page load and not before.
To speed up scraping, I need to intercept resources before loading them. I initially hoped a line of JavaScript could achieve this, but it cannot. After asking ChatGPT some questions, it was clear I would need an additional JavaScript library to do what I wanted.
Thankfully, all is not lost; Axiom.ai includes Puppeteer as a dependency, with which I can now do some neat tricks.
# Why we will use Puppetter
We will use Puppeteer because Axiom.ai already has the library installed. This lets us call Puppeteer functions directly in the JavaScript step without manually loading it, making it easier to get started.
Before writing this blog, I knew Puppeteer could intercept page requests, but I didn’t know how. I had never written any Puppeteer code before.
No problem! A quick chat with ChatGPT introduced me to the setRequestInterception method. This Puppeteer method intercepts network requests, giving us control over how resources like images, videos, and scripts load.
With this method, I can block unnecessary resources with just a few lines of code before they load. That’s exactly what I need to speed up my scraping workflow!
# How to get started and replicate these experiments
I tested all my Puppeteer scripts by quickly building a simple bot to check if they worked. When blocking images, I could see the site load without them when I clicked Run. To keep things simple, I judged success by observing whether the page loaded faster. If I saw a noticeable difference, I knew it was worth implementing.
# Steps to test your setup:
From the dashboard, click "New Automation".
- Use the step finder to add the "JavaScript" step. Since we're using Puppeteer functions, set this step to "Run in App".
- Add the "Go to Page" step and enter a URL.
- Finally, add a "Wait" step with a 5-second delay to prevent the bot from shutting down too quickly.
You can apply the JavaScript step using the code below to any of your web scrapers (opens new window) built in Axiom.ai.
# Brief explanation of what the code does
Let’s break down the code block below to understand each part. You’ll notice a clear pattern, and I’ll walk you through its purpose step by step.
# Code explanation
This line calls the Puppeteer function setRequestInterception and prepares the bot to intercept page requests.
await page.setRequestInterception(true);
Next, this line listens for page requests as the page loads. Puppeteer triggers the function for each request the webpage makes.
page.on('request', request => {
Here, we determine which requests to block and which to allow. In this example, we block all image requests by checking the resourceType. If the request is an image, we call request.abort() to block it. Otherwise, we call request.continue() to let it proceed.
if (request.resourceType() === 'image') {
request.abort();
} else {
request.continue();
}
# What this code does overall
- Intercepts every network request as the page loads.
- Checks the type of each requested resource (e.g., images, scripts, CSS).
- Blocks images to reduce bandwidth usage and speed up page loading.
- Allows all other resource types to load as usual.
# How to intercept and block images from loading
First, I experimented with blocking images to see if it made a noticeable difference in load speed.
I tested a simple block of code that prevents all images from loading. I inserted the code into my JavaScript step and clicked Run. When the page loaded, no images appeared.
await page.setRequestInterception(true);
// Block all image requests
page.on('request', request => {
if (request.resourceType() === 'image') {
request.abort();
} else {
request.continue();
}
});
This simple test confirmed that blocking images significantly reduces the page's visual load. While I didn’t measure exact speed improvements, the difference was immediately noticeable.
# Block all images by multiple domain
I refined my code to allow blocking images by domain. You can copy and paste this code into a JavaScript step, replacing "Insert domain" with your desired domain.
// Enable request interception
await page.setRequestInterception(true);
// Define the domains from which to block all image requests
const blockedImageDomains = [
'jlr.scene7.com',
'anotherdomain.com',
'yetanotherdomain.com',
// Add more domains as needed
];
// Block all image requests from the specified domains
page.on('request', request => {
const requestUrl = request.url();
const urlHostname = new URL(requestUrl).hostname;
if (
request.resourceType() === 'image' &&
blockedImageDomains.includes(urlHostname)
) {
request.abort(); // Block the image request
} else {
request.continue(); // Allow all other requests
}
});
# How to block all videos
Websites increasingly feature video content. While streaming services and faster broadband have improved load speeds, videos can still slow down page loading.
To block video content, I modified my code by changing the resource type from image to media. Here’s the updated snippet:
await page.setRequestInterception(true);
// Block all video (media) requests
page.on('request', request => {
if (request.resourceType() === 'media') {
request.abort(); // Block video/audio requests
} else {
request.continue(); // Allow all other requests
}
});
This adjustment prevents videos from loading, which can significantly reduce a webpage’s load time. The change is simple and can be easily applied in a JavaScript step.
# How to block CSS stylesheets
Next, I experimented with blocking CSS to see if it could speed up load times. Below, I’m sharing two scripts: one blocks all CSS, and the other targets specific stylesheets by URL.
await page.setRequestInterception(true);
// Block all CSS (stylesheet) requests
page.on('request', request => {
if (request.resourceType() === 'stylesheet') {
request.abort(); // Block CSS requests
} else {
request.continue(); // Allow all other requests
}
});
Blocking CSS can sometimes improve load speed, especially when stylesheets are large or externally hosted. However, be cautious blocking essential CSS may break the webpage layout, making it harder to scrape structured content. I recommend blocking only external stylesheets related to the main web application you’re scraping. Use the code below to do this.
# How to block CSS style sheets by url
await page.setRequestInterception(true);
// Define the URLs of stylesheets to block
const blockedStylesheetUrls = [
'https://example.com/styles/blocked-style.css',
'https://anotherdomain.com/css/unwanted.css',
// Add more stylesheet URLs as needed
];
// Block specific stylesheet requests
page.on('request', request => {
if (
request.resourceType() === 'stylesheet' &&
blockedStylesheetUrls.includes(request.url())
) {
request.abort(); // Block the stylesheet
} else {
request.continue(); // Allow all other requests
}
});
# My final experiment blocking third-party apps and JavaScript files
For my final experiment, I blocked third-party apps and JavaScript files. These resources often track data or power ads, and they can significantly slow down page load times.
This script blocks requests to third-party domains by filtering the URLs of the source files.
// Enable request interception
await page.setRequestInterception(true);
// Define the URLs of JavaScript files to block
const blockedScriptUrls = [
'https://example.com/scripts/unwanted-script.js',
'https://anotherdomain.com/js/tracking.js',
// Add more script URLs as needed
];
// Block specific JavaScript requests
page.on('request', request => {
if (
request.resourceType() === 'script' &&
blockedScriptUrls.includes(request.url())
) {
request.abort(); // Block the JavaScript file
} else {
request.continue(); // Allow all other requests
}
});
# A combined script for blocking images, videos, and URLs
Here’s a script that combines all the functionality from the previous examples. With this single script, you can efficiently block images, videos, and specific URLs.
If you’re not a developer, don’t worry—it’s simple to use:
- Copy the code into a JavaScript step and set it to Run in App.
- To block additional domains, add them to the blockedDomains array.
- By default, the script blocks images and media.
// Define third-party script domains to block
const blockedScriptDomains = new Set([
'thirdparty.com',
'anotherdomain.net',
'yetanotherapp.org',
// Add more domains as needed
]);
page.on('request', request => {
const requestUrl = request.url().toLowerCase();
const resourceType = request.resourceType();
let hostname = '';
try {
hostname = new URL(requestUrl).hostname;
} catch {
// If URL parsing fails, proceed without blocking
}
if (
resourceType === 'image' ||
resourceType === 'media' ||
(resourceType === 'script' && blockedScriptDomains.has(hostname))
) {
request.abort();
} else {
request.continue();
}
});
# Wrapping up – is writing a few lines of code worth it?
To intercept requests and block resources, we use Puppeteer because it includes a built-in function for request interception. Fortunately, Puppeteer is part of Axiom.ai, making the process simple and accessible.
When blocking resources, it’s important to consider the potential impact. Blocking certain elements might break the website you’re trying to scrape. However, blocking images, videos, third-party apps, and fonts is usually safe and can significantly speed up web scraping. (opens new window)
I should also mention ChatGPT. Without it, learning this would have been harder and taken much longer. ChatGPT truly makes coding accessible to no-coders. It’s especially helpful in breaking down and explaining what each part of the code does. We also intergrate ChatGPT (opens new window) into Axiom.ai.
Final thoughts:
- Can a no-coder use Puppeteer for this? Yes! Even without coding experience, you can use simple Puppeteer scripts to enhance your scraping workflow.
- Have I improved my skills? Absolutely, this process has been a great learning experience.
- Can any no-coder do this without ChatGPT? Yes, it’s surprisingly simple when broken down step by step.
One question I’m still pondering: What happens if the site loads too quickly? Could that trigger anti-bot measures? Web scraping isn’t always straightforward, so I’ll need to consider this when building my next scraper. But I’m definitely using this new method I’ve learned!
FYI in a future release (opens new window) of the Axiom.ai no-code builder (opens new window),we're adding a settings feature to block resources.