How to extract data from HTML with an AI
Learn how simple it is to create a web scraper that loops through URLs in a Google Sheets, scrapes the HTML, and then uses ChatGPT to extract data—all by combining just a few steps and completely bypassing the need for CSS selectors. get started quickly with this ChatGPT web scraper template. Visit these pages to learn more about getting started with extracting data and using our builder. A ChatGPT subscription is required to run this bot.
Design pattern: AI web scraper
# Start from blank, adding the following steps
In the Axiom.ai Chrome extension dashboard, click "New Automation" and then select "Add first step". Use the step finder to add the steps outlined below.
# Prepare your Google Sheet
Separate your URLs row by row in the same column.
| Col A | Col B |
|---|---|
| Insert your URLs like this | --- |
| Insert your URLs like this | --- |
| Insert your URLs like this | --- |
# Add a ‘Read data from a Google Sheet’ step
First we want to fetch the URLs from the Google Sheet.
- Spreadsheet - Search for the Google Sheet you created in the "Spreadsheet" field. Once found, click to select.
- Sheet name - Choose a sheet tab or leave blank to use the first tab.
- First cell - Start from a specified column and row, for example, "A1".
- Last cell - End at a specified column and row, for example, "AB1".
Tip
💡 To read a single row of data set a First and Last cell. This can be useful when you want to run a quick test run.
# Add a ‘Loop through data’ step
We use this step to loop the web scraper. Select the data you wish to loop through, which in this case is the URLs. Learn more about looping.
- Loop through data - Click ‘Insert data’ select ‘google-sheet-data’
# Add a ‘Go to page’ sub-step (Insert in the loop)
Next we load URLs into a Chrome browser ready to extract the data from. This step is inserted in the loop!
- Enter URL - Click ‘Insert data’ select ‘google-sheet-data’
# Add a "Get data from bot's current page' sub-step
Now select the HTML to extract.
- Select - Click select and choose the outer most element
- Select HTML - Once selected change the data type to "select HTML", see how.
Tip
You could always use a custom selector such as "body" or "html" to grab all the HTML with the selector tool.
# Add a "Extract data with ChatGPT" step
Now we extract the data from the HTML using an AI. You may need to experiment setting the values you wish to extract.
- ChatGPT API key - Enter your API key.
- Data - Insert "[scrape-data]".
- Extract values - Insert the values you want to extract separated by a comma for example "name, email,job title".
Go here to learn more about our ChatGPT extract step.
# Add a "Write data to a Google Sheet" sub-step
Next we we output our scraped data to a Google Sheet.
- Spreadsheet - Search for and select a Google Sheet to write data to, or paste its URL here.
- Sheet name - Choose a sheet tab or leave it blank to use the first tab.
- Data - Select the data to input into the Google sheet by clicking "Insert data", then choose "scraped-link-data".
- Write options - Select "Add to existing data" to write the new data to the sheet without first deleting existing data.
# Add a ‘Delete rows from a Google Sheet’ sub-step
Finally, we delete the processed row that has just been completed in the Loop so the scraper does not repeat the same scrape on the next loop.
- Spreadsheet - Search for the Google Sheet you created in the "Spreadsheet" field. Once found, click to select.
- Sheet name - Choose a sheet tab or leave blank to use the first tab.
- First row to delete - Leave set to 1.
- Last row to delete - Leave set to 1.
# Wrapping up
In just seven steps you can create a simple web scraper to extract data from any website with ChatGPT and write it to a Google Sheet. The super cool thing is that this scraper does not rely on CSS selectors to extract data that can change, because it uses AI to extract the data. So this design pattern will work on any website.