DOM Parsing & Finding Links/Images with Python

So you’re looking to find links or images from some HTML. A lot of peoples first instinct is to use RegEx, but as it turns out, HTML (and most markup languages) are too flexible for this to be a reliable method of parsing – it should be avoided at all costs.

The right way to do it is by using a DOM parser. A DOM parser is a library that will traverse the HTML and make a tree-like structure of the elements throughout it. This makes it fairly easy to search and manipulate in a reliable manner. There’s a very good DOM parsing library for Python called Beautiful Soup.

So, for the purpose of this example, lets say we have a variable called html that contains the page body that we want to parse.

Now you’ve got a DOM object called soup which you can manipulate. If you want to find all of the links on the page, the easiest way is to call the findAll() function and pass it an "a" tag to search for.

Now this will give you a list of node objects. If you just want the text links, you can just do a list comprehension get the .href attributes:

Now you have all of your links as a list of strings to do with as you please!

We can repeat the process to find all of the images pretty simply:

This is obviously a very simple use case. I highly recommend skimming the BeautifulSoup documentation to see what else you can do with it as it’s an incredibly powerful library.