Packet Capture in Python with PyPCAP and DPKT

I’d recently found myself doing some packet analysis in python using pypcap and dpkt. These are very powerful tools, but they are woefully undocumented so I thought I’d put up a simple example of some packet capture with them.

Setup

First off, they require a bit of setup. You’ll need to install libpcap. On Ubuntu it’s as simple as
$ sudo apt-get install libpcap-dev
After that’s done you can simply use pip or easy_install to get pypcap and dkpt

$ sudo pip install pypcap
$ sudo pip install dpkt

Running

First you want to create a pcap object and set up any filters you need for the capture. The pcap constructor takes a few parameters

  • name: the name of the interface to listen on, such as ‘eth0′ or ‘en0′
  • timeout_ms: The number of milliseconds to wait for traffic before timing out
  • immediate: Defaults to False. Setting this to true removes buffering and will return packets to you as they come in instead of in batches.

You then write a callback function to handle each packet:

In this example, if you’re looking for TCP or UDP information, it would be contained in ip.data you can check the type to determine which one. This function will just print the source IP addresses as integers.

To start collecting packets you call the loop function. loop takes 2 parameters

  • The number of packets to collect. 0 or -1 means continuous
  • The callback function to fire for each packet

It’s important to note that if there’s no traffic matching your filter in the timeout that you set in the pcap constructor, the function will return with no results. If you want to truly capture indefinitely, I recommend setting a high timeout like 60,000ms and putting your loop call inside a while loop with any necessary break clauses. Note that there is a small potential to miss traffic if you hit your timeout and re-call loop() due to a small initialization time, which is a good reason for the long timeout.

DOM Parsing & Finding Links/Images with Python

So you’re looking to find links or images from some HTML. A lot of peoples first instinct is to use RegEx, but as it turns out, HTML (and most markup languages) are too flexible for this to be a reliable method of parsing – it should be avoided at all costs.

The right way to do it is by using a DOM parser. A DOM parser is a library that will traverse the HTML and make a tree-like structure of the elements throughout it. This makes it fairly easy to search and manipulate in a reliable manner. There’s a very good DOM parsing library for Python called Beautiful Soup.

So, for the purpose of this example, lets say we have a variable called html that contains the page body that we want to parse.

Now you’ve got a DOM object called soup which you can manipulate. If you want to find all of the links on the page, the easiest way is to call the findAll() function and pass it an "a" tag to search for.

Now this will give you a list of node objects. If you just want the text links, you can just do a list comprehension get the .href attributes:

Now you have all of your links as a list of strings to do with as you please!

We can repeat the process to find all of the images pretty simply:

This is obviously a very simple use case. I highly recommend skimming the BeautifulSoup documentation to see what else you can do with it as it’s an incredibly powerful library.