How can CSS attribute selectors be used for web scraping?

Want to scrape data from a web page? Axiom makes it easy with our 'Get data from the webpage' Step. Our selector tool lets you easily select the content you want to scrape by pointing and clicking.

However, occasionally the selector tool can fail to return the correct results. When this happens, Axiom will display an error message such as this:

Error in step 1 - "Get data from a webpage": Your chosen selectors have failed to find any content on page 'https://www.bbc.co.uk/'. If the page is loading too slowly, try adding a min wait. Use Continue on error if you want your Axiom to continue.
Learn more about common errors

If you experience this issue, you can try using a different selection - but sometimes the problem still persists. In that case, you need a custom selector! One of the most useful custom selectors to know about is the attribute selector.

In this guide, we'll show you how to find attribute selectors on a webpage. We'll provide some examples of when this may be necessary, and teach you how to use custom selectors with Axiom.

So if you follow this guide, you'll be well on your way to becoming a web scraping pro. If you're new at the game, we start with a primer on CSS selectors.

# What is a CSS selector?

CSS selectors are one of the two parts of a CSS rule, which are for creating the visual style of HTML elements. These two parts are the selector, which determines which element to apply the style to, and the declaration, which is a set of instructions for specifying the appearance of the element.

Here, the CSS selector called ".xl-red-font” is applied to a html element inside a class.

<h1 class="xl-red-font">Makes this text large and red</h1>

If we wanted this header text to be shown as a gigantic red font, the entire rule would be as follows:

.xl-red-font {
	color: red;
	font-size: 1090px;
}

The CSS selector is the “.xl-red-font” part that appears in the HTML, specifying which element should be selected - hence its name. The declaration lives within the curly braces and sets the visual style. In this case, it’s big and red!

# What is a CSS attribute selector?

These selectors allow you to choose elements based on their HTML attributes. Much like with CSS, HTML is split into component parts. In this case we can think of HTML as consisting of elements and attributes. The element is the type of HTML entity to render, such as a link (”a”) or a button (”button”) or a header (”h1”, “h2” etc.). The attributes are everything else.

The following example HTML specifies an “input” element with a number of attributes - “class”, “autocapitalize”, “autocomplete” etc. are all HTML attributes.

<input class="" autocapitalize="none" autocomplete="off" autocorrect="off" id="react-select-choose-a-tone-input" spellcheck="false" tabindex="0" type="text" aria-autocomplete="list" aria-expanded="false" aria-haspopup="true" aria-label="choose-a-tone" aria-labelledby="choose-a-tone" role="combobox" value="" style="color: inherit; background: 0px center; opacity: 1; width: 100%; grid-area: 1 / 2 / auto / auto; font: inherit; min-width: 2px; border: 0px; margin: 0px; outline: 0px; padding: 0px;">

In order to create an attribute selector, simply take the attribute as it appears in the HTML and wrap it in square brackets. That’s it!

[aria-labelledby="choose-a-tone"]

When choosing an attribute to select on, make sure you choose one that uniquely specifies the element you want - otherwise you might end up with the wrong element. In the above example, the “aria-labelledby” attribute is a unique selector that is only present on this element.

Identifying the right attribute can be a bit of an art, but when they contain human-readable text (as above) it’s a lot easier.

Because these selectors contain human readable text, they can often be very useful! For example, Google Maps contains zip codes as attribute selectors in its HTML, which lets you find locations there very conveniently.

# Are there other kinds of CSS selectors?

There are many other types of CSS selectors, including the ones listed below. Often when constructing a selector, you will use one or more of these selectors in combination with an Attribute selector.

# The CSS element Selector

We’ve already met elements, and the element selector allows you to target all elements which share a tag name.

Example of HTML element selectors:

<h1>,<span>,<div>,<strong>,<a>,<input>,<button>

In order to specify an element selector, just use the name of the element tag (without any attributes or angle brackets).

For example, the above elements can be selected like this:

h1,span,div,strong,a,input,button

# The CSS id Selector

The id attribute gives an HTML element a name that can be used to reference the element from other parts of the document, or from external documents. The name must be unique within the HTML document.

Here’s some example HTML with an id specified:

<h1 id="page-title"> Browser automation is so cool</h1>

The id attribute has some special convenient syntax. Instead of typing [id=”page-title”], you can use a # instead, like this:

#page-title

# The CSS class Selector

The class attribute tags an HTML element with a re-usable name that can be used to identify it and similar elements like it. They are usually not unique within a document, but specify a set of related elements.

Here’s some HTML with a class specified:

<h1 class="page-title"> Browser automation is so cool</h1>

Similarly to the id attribute, instead of using the usual [class=”page-title”] syntax you can use a dot:

.page-title

To learn more about this topic, you can check out the following resource:

https://www.w3schools.com/css/css_selectors.asp (opens new window)

https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors (opens new window)

http://web.simmons.edu/~grabiner/comm244/weekfour/selectors (opens new window)

# Attribute selectors can be combined with other CSS selectors

It can be very handy to combine attribute selectors in combination with other kinds of selectors. Doing this allows you to create more precise rules that target just that one element you want from a page, without pulling in any others.

# LinkedIn in pager button

Here is the HTML from a LinkedIn pager button:

<button aria-label="Next" id="ember1951" class="artdeco-pagination__button artdeco-pagination__button--next artdeco-button artdeco-button--muted artdeco-button--icon-right artdeco-button--1 artdeco-button--tertiary ember-view" type="button">  <li-icon aria-hidden="true" type="chevron-right-icon" class="artdeco-button__icon" size="small"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" class="mercado-match" data-supported-dps="16x16" fill="currentColor" width="16" height="16" focusable="false">
  <path d="M5 15l4.61-7L5 1h2.39L12 8l-4.61 7z"></path>
</svg></li-icon>

<span class="artdeco-button__text">
    Next
</span></button>

We could extract the following attribute ‘aria-label="Next"’ to make this selector. The aria-label attribute seems to be human readable text, so it looks like a great choice for identifying that “Next” button.

We can then combine this with other selectors to make sure we have exactly the element we want. Here are a couple of examples of this in action:

button[aria-label="Next"]

.artdeco-pagination__button[aria-label="Next"]

# Facebook group

Below is the outer wrapper div from an article from a Facebook Group page.

<div aria-describedby="jsc_c_jb jsc_c_jc jsc_c_jd jsc_c_jf jsc_c_je" 
aria-labelledby="jsc_c_ja" class="lzcic4wl" role="article">

We could use ‘role="article"’ to target elements in the Article. The selector below would extract the name of the author. In this example we have combined four selectors: an element, an attribute, and then two more elements.

div[role="article"] a strong

Here, the spaces indicate that the “a” and “strong” elements are children of the div with the attribute role=”article”. This means that the selector will look for an “a” element, inside a “strong” element, inside a “div” element with the attribute role=”article”.

# How do I find selectors on a web page?

Google Chrome has a built-in tool for inspecting the webpage that makes this easy. Web developers like to show off by making it seem like magic, but it's actually pretty simple to learn! Here's a short video that shows you how it works.

💡 To search for anything on a web page, right-click on an element and select Inspect. This will open Developer Tools. Then use Ctrl+F or Command+F to search for anything within the source code of the page.

Learn more about Chrome Developer tools and the web inspector (opens new window).

# How do you define your own CSS selectors?

Here are three methods for generating custom selectors. Ultimately, you will want to master the art of combining these methods.

# 1. Identify a selector unique to the element you want

Look for selectors that are unique to the element. If it has a unique attribute, use that.

For example, if you wanted to click the Tweet button on my Twitter feed, here's the HTML:

<div role="button" class="css-18t94o4 css-1dbjc4n r-l5o3uw r-42olwf r-sdzlij r-1phboty r-rs99b7 r-19u6a5r r-2yi16 r-1qi8awa r-1ny4l3l r-ymttw5 r-o7ynqc r-6416eg r-lrvibr" 
data-testid="tweetButtonInline" tabindex="0"><div dir="auto" class="css-901oao r-1awozwy r-jwli3a r-6koalj r-18u37iz r-16y2uox r-37j5jr r-a023e6 r-b88u0q r-1777fci r-rjixqe r-bcqeeo r-q4m81j r-qvutc0">
<span class="css-901oao css-16my406 css-bfa6kz r-poiln3 r-a023e6 r-rjixqe r-bcqeeo r-qvutc0"><span class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0">Tweet</span></span></div></div>

We would use the div element with an attribute selector.

div[data-testid="tweetButtonInline"]

Sometimes, websites strip out unique selectors to make them more difficult to scrape. If this is the case for your target site, you’ll need to use more devious methods….

# 2. Find a unique grouping selector

All HTML is ordered in a hierarchy. This means you can often find the element you want by finding a parent, i.e. an element which wraps around and contains your desired target.

For example, if you wanted to scrape the related trends from Google Trends search, you will see that the related trends are grouped in a block of HTML that defines a widget. First, you find a unique selector for that widget:

<widget type="fe_related_queries" version="1" template="fe" on-event="onEvent({'event': event})" embed="embed" share="share" export="export" explore-query="exploreQuery" fields="[
        {'name': 'topic', 'value': title},
        {'name': 'color', 'value': color}
    ]" apis="[
        {
          'name': 'fe_relatedsearches',
          'url': config.pathPrefix + '/api/widgetdata/relatedsearches',
          'params': {'req': request, 'token': token}
        }
    ]" story-title="storyTitle" story-country="storyCountry" story-time-range="storyTimeRange" show-mode-picker="true" palette="palette" forced-color="" help-dialog="helpDialog" widget-name="widgetName" ve-tracking="" jslog="39387; track:impression"><!---->

In this case, you can use the element ‘widget’ and the following attribute selector:

widget[type="fe_related_queries"]

Then to scrape the topics we can add an “a” element selector, separated by a space, to find the “a” element within the widget HTML.

widget[type="fe_related_queries"] a

Still, it can be the case that it’s hard to even find a grouping element with a unique selector. So what do we do then?

# 3. Look for the unique positioning of your content within the page

Webpages are made up of different elements, all of which are contained by other elements. Some elements are parents and others children; some elements can be both. Every element occupies exactly one position within this hierarchy, and this can be specified uniquely with a selector - which means if you can figure out that selector, you can pinpoint your target. Bullseye! 🎯

In this example we want to work out a custom selector to click the like button of the first Instagram post on my feed. Heres the html from that article:

<article class=" _ab6k _ab6l _ab6m _aatb _aatc _aate _aatf _aath _aati" role="presentation" tabindex="-1"><div class="_ab8w  _ab94 _ab99 _ab9h _ab9m _ab9p _abcm" style="max-height: inherit; max-width: inherit;"><div class="_aasi _aasj"><div class="_ab8w  _ab94 _ab97 _ab9i _ab9k _ab9p _abcm"><header class="_aaqw _aaqx"><div class="_aap6 _aap7 _aapa"><div class="_aarf _aarg _aaqq" aria-disabled="false" role="button" tabindex="0" style="cursor: pointer;"><canvas class="_aarh" height="84" width="84" style="position: absolute; top: -5px; left: -5px; width: 42px; height: 42px;"></canvas><span class="_aa8h" role="link" tabindex="-1" style="width: 32px; height: 32px;"><img alt="zoeandbrodsforeheadkisses's profile picture" class="_aa8j" crossorigin="anonymous" draggable="false"></span></div></div><div class="_aaqy _aaq-"><div class=" _aar1"><div class="_aaqt"><div class="_ab8w  _ab94 _ab97 _ab9f _ab9k _ab9p _abcm"><div class="_aacl _aaco _aacw _aacx _aad6 _aade"><span class="_aap6 _aap7 _aap8"><a class="oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl _acan _acao _acat _acaw _a6hd" href="/zoeandbrodsforeheadkisses/" role="link" tabindex="0">zoeandbrodsforeheadkisses</a></span></div></div></div></div><div class="_aaql"><div class="_aacl _aacn _aacu _aacx _aad6 _aade"><div></div><div class="_aaqm"><div class="_aacl _aacn _aacu _aacy _aada _aade"><a class="oajrlxb2 g5ia77u1 qu0x051f esr5mh6w e9989ue4 r7d6kgcz rq0escxv nhd2j8a9 nc684nl6 p7hjln8o kvgmc6g5 cxmmr5t8 oygrvhab hcukyx3x jb3vyjys rz4wbd8a qt6c0cv9 a8nywdso i1ao9s8h esuyzwwr f1sip0of lzcic4wl _aaqk _a6hd" href="/explore/locations/235395557/thredbo-resort/" role="link" tabindex="0">Thredbo Resort</a></div></div></div></div></div></header><div class="_aasm _aasn"><button class="_abl-" type="button"><div class="_abm0"><div class="_ab8w  _ab94 _ab97 _ab9h _ab9m _ab9p _abcm" style="height: 24px; width: 24px;"></div></div></button></div></div></div><div class="_aatk"><div class="_aamm"><div class="_aamn"><div class="_aami" style="padding-bottom: 125%;"></div><div class="_ab8w  _ab94 _ab99 _ab9f _ab9m _ab9p _abcf _abcg _abch _abck _abcl _abcm"><div class="_aao_"><div class="_aap0" role="presentation"><div class="_aap1"><ul class="_acay"><li style="transform: translateX(3289px); width: 1px;"></li><li class="_acaz" tabindex="-1" style="transform: translateX(0px);"><div class="_ab8w  _ab94 _ab99 _ab9f _ab9m _ab9p _abcm" style="width: 470px;"><div role="button" class="_aa06" tabindex="0"><div><div class="_aagu _aamh"><div class="_aagv" style="padding-bottom: 125%;"><img alt="Photo by Zoe Sandell on July 21, 2022. May be an image of 7 people, people standing, nature and people skiing." class="_aagt" crossorigin="anonymous" decoding="auto" sizes="470px" </div><div class="_aagw"></div></div></div></div></div></li><li class="_acaz" tabindex="-1" style="transform: translateX(470px);"><div class="_ab8w  _ab94 _ab99 _ab9f _ab9m _ab9p _abcm" style="width: 470px;"><div role="button" class="_aa06" tabindex="0"><div><div class="_aagu _aamh"><div class="_aagv" style="padding-bottom: 125%;">

The first article and button haven't been given any unique selectors, but the positioning of those elements must be unique. We want to click the first button in the first article.

article:nth-child(1) section span:nth-child(1) button

Above I use the position of the ‘article’ by adding ‘nth-child(1)’, you can target any child by changing the number. So nth-child(2) will select the second article within its parent, and so on. Repeat the trick to find and click the correct button.

Downside: elements can change in their relative positions on the page when data is added or removed. For example, property websites tend to have lots of changing data on a page-by-page basis. This can cause the absolute precision of your hierarchical selectors to work against you!

You can learn about children here. (opens new window)

# What Axiom steps can custom selectors be used with?

The following steps in Axiom feature our no-code selector tool and can be used in conjunction with custom css selectors:

  • Scrape data from a Webpage
  • Scrape links from a webpage
  • Click Element or Click Elements
  • Enter text
  • Select list
  • Download file/files
  • Upload fiile/files

You can learn more about Axiom’s Steps here.

# How do I use custom selectors with Axiom?

Axiom provides a no-code interface for adding custom selectors.

Click on the button ‘<> set custom selector’.

Custom selector axiom.ai

Delete any code you see inside the box.

remove css code from Axiom.ai

Add you custom selector and save.

add you attribte selector into Axiom.ai

Please note ‘Get data from a web page’ can handle multiple selectors, to use more than one selector separate them with commas. These types of selector will also only return text results.

To use custom selectors you will need to edit the selector data model (we’ll get to that in a second).

add multipile custom selectors

# Setting advanced custom selectors using Axiom

If you're really interested in getting into the nitty-gritty with Axiom, you can edit the selector data model manually to tweak the selectors. However, this does require a degree of knowledge. The good news is that we are working on a no-code interface that makes it easy 😀

The selector data is specified in JSON, which is a standard format used for storing data on the web. If you’re familiar with JSON, great! If not here’s a quick cheat sheet:

edit data object for css selectors

To change the selectors edit the code highlighted in green

To change the result type change the value in pink to one of these: “link”, “textContent” or “innerHTML”

Add new selectors by adding ‘{"selector":"","selections":[],"selectedElements":[{}],"rejectedElements":[],"resultType":"innerHTML"}’ and do separate with a comma.

Be warned any syntax errors can stop the selector from working! However, if you do mess it up you can always switch back to Axiom’s selector tool and re-select some data, which will regenerate the object so you can try again.

# When should you use attribute selectors with Axiom?

# If the bot fails to click the "next" button when scraping a listing page

Let’s say you have created a bot to scrape all the data from a web page using the ‘Get data from a webpage’ step. You have set up the select and pager rows in the step and you see green ticks next to them. The scraper should retrieve all results, but when you run the bot, it fails to move onto the next page.

This is what the step looks like when it’s correctly configured:

edit data object for css selectors

If you run your automation and the listing doesn't advance to the next page, try adjusting the selection of the pager button. For example, try selecting it several times or try selecting it in a slightly different way each time. If that doesn't work, it’s custom selector time! See this Linkedin example:

<button aria-label="Next" id="ember1951" class="artdeco-pagination__button artdeco-pagination__button--next artdeco-button artdeco-button--muted artdeco-button--icon-right artdeco-button--1 artdeco-button--tertiary ember-view" type="button">  <li-icon aria-hidden="true" type="chevron-right-icon" class="artdeco-button__icon" size="small"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 16 16" class="mercado-match" data-supported-dps="16x16" fill="currentColor" width="16" height="16" focusable="false">
  <path d="M5 15l4.61-7L5 1h2.39L12 8l-4.61 7z"></path>
</svg></li-icon>

<span class="artdeco-button__text">
    Next
</span></button>

We would extract the following attribute ‘aria-label="Next"’ to make this selector. Combining both element and attribute selectors, we can come up with one of the following:

button[aria-label="Next"]

.artdeco-pagination__button[aria-label="Next"]

Either one of these should work to find the button on every page.

# If the bot does nothing when entering text into an input form

You've created a bot that enters data into a webform. The bot opens the correct page and selects the correct form input, but no text is input.

Let’s see how the enter text step is configured. You should see a green tick in the File field, and some example text in the Text field. This looks like it should work:

Entering text into unputs browser autoamtion

However, when you run the automation nothing happens - no text is entered.

As always, the simple solution is to try re-selecting the input field. But if that doesn't work, you can try using a custom selector.

For example, if the HTML of your text field looks like this:

<textarea data-testid="project-desc" placeholder="Your sentence goes here" class="flex flex-1 py-2.5 px-3.5 border-purple-100 rounded shadow-sm resize-none w-full placeholder-purple-200 focus:outline-none focus:ring-0 focus:border-turquoise-700 text-gray-800" type="text" autocomplete="off" id="what-sentence-would-you-like-to-rewrite" style="min-height: 128px; max-height: unset; height: 126px;">For example if you wanted to scrape Zillow listings and not the sponsored content you could use this selector:</textarea>

Then the following attribute selector looks like a good solution:

textarea[data-testid="project-desc"]

# If your scrape works inconsistently, or does not get all the results you want

You've made a bot to scrape multiple real estate pages on a website such as Zillow. It seems simple enough: select the content and write it to a Google Sheet. For 90% of use cases, it is in fact that simple!

However, property websites have varying content and layouts. Not all content is present on every page. This type of problem can break any and all web scrapers scrambling the results returned.

selecting data to scrape

To solve this problem, call on custom selectors to give you a helping hand!

For example, to scrape Zillow listings and not any sponsored content you could use this selector:

.result-list-container .list-card-info a

# Conclusion

With the right methods, web scraping is a simple way to collect important data. If you're comfortable using custom selectors, there will be no web page that you can't extract content from. You'll become an expert in no time. Attribute selectors in particular are often unique and can be useful if you want to create custom selectors quickly.

Good luck!

UPDATE!!!

In the next couple of release we will be introducing a no-code tool that provide hints and helps you set custom selectors. Watch this space!

Contents

    Install the Chrome Extension

    Two hours of free runtime, no credit card required