Analyzing raw TCP streams

An obvious way to capture HTTP requests and responses is to dump them with a network sniffer. This only works for cleartext connections (without TLS encryption), but on the other hand, you don’t need to change your clients or servers.

HTTPolice can parse HTTP/1.x streams from the ground up. Parsing HTTP/2 is not yet supported.

Using tcpflow

You may be familiar with tcpdump, but it won’t work: HTTPolice needs the reassembled TCP streams, not individual packets. You can get these streams with a tool called tcpflow:

$ mkdir dump

$ cd dump/

$ sudo tcpflow -T'%t-%A-%a-%B-%b-%#' port 80
reportfilename: ./report.xml
reportfilename: ./report.xml
tcpflow: listening on wlp4s0

(Note the -T option—it is necessary to get the right output.)

tcpflow starts capturing all connections to or from TCP port 80. For example, you can launch a Web browser and go to an ‘http:’ site. Once you are done, exit the browser, then stop tcpflow with Ctrl+C. (It is important that connections are closed before tcpflow shuts down, otherwise they may be incomplete.)

Now you have one or more pairs of stream files:

$ ls
1469847441-  report.xml

Tell HTTPolice to read this directory with the tcpflow input format:

$ httpolice -i tcpflow .

HTTPolice will combine the files into pairs based on their filenames. Due to a limitation in tcpflow, this only works if every combination of source+destination address+port is unique. If there are duplicates, you will get an error.

It’s OK if you capture some streams that are not HTTP/1.x. HTTPolice will just complain with notices such as 1279. This means you can run tcpflow without a filter, capturing all TCP traffic on a given network interface, and then let HTTPolice sort it out while silencing those notices:

$ sudo tcpflow -T'%t-%A-%a-%B-%b-%#'

$ httpolice -i tcpflow -o html -s 1279 . >../report.html

Using tcpick

tcpick is another tool for reassembling TCP streams. It doesn’t have the “unique port” limitation of tcpflow, but it has a different problem: sometimes it produces files that are clearly invalid HTTP streams (HTTPolice will fail to parse them with notices like 1009).

Anyway, using it is very similar to using tcpflow:

$ mkdir dump

$ cd dump/

$ sudo tcpick -wR -F2 'port 80'
Starting tcpick 0.2.1 at 2016-07-30 06:14 MSK
Timeout for connections is 600
tcpick: listening on wlp4s0
setting filter: "port 80"
3837 packets captured
30 tcp sessions detected

$ httpolice -i tcpick .

(Note the -wR -F2 options.)

Other sniffers

If you use some other tool to capture the TCP streams, use the streams input format to pass pairs of files:

$ httpolice -i streams requests1.dat responses1.dat requests2.dat ...

Or req-stream if you only have request streams:

$ httpolice -i req-stream requests1.dat requests2.dat ...

Or resp-stream if you only have response streams (not recommended):

$ httpolice -i resp-stream responses1.dat responses2.dat ...

Note that resp-stream may not work at all if any of the requests are HEAD, because responses to HEAD are parsed differently.

Combined format

Sometimes you want to compose an HTTP exchange by hand, to test something. To make this easier, there’s a special input format that combines the request and response streams into one file:

The lines at the beginning are ignored.
You can use them for comments.

======== BEGIN INBOUND STREAM ========
GET / HTTP/1.1
User-Agent: demo

======== BEGIN OUTBOUND STREAM ========
HTTP/1.1 200 OK
Date: Thu, 31 Dec 2015 18:26:56 GMT
Content-Type: text/plain
Connection: close

Hello world!

It must be saved with CRLF (Windows) line endings.

Also, for this format, the filename suffix (extension) is important. If it is .https, the request URI is assumed to have an https: scheme. If it is .noscheme, the scheme is unknown. Otherwise, the http: scheme is assumed.

Now, tell HTTPolice to use the combined format:

$ httpolice -i combined exchange1.txt

More examples can be found in HTTPolice’s test suite.