split-pcap.pl - A Perl script to separate the TCP flows in a pcap file

Packet trace files can contain packets from many distinct TCP flows. Sometimes you want to look at specific flows or compare two flows. If you know what flows you want to look at it is easy to extract them from the file BUT if you want to extract each flow into its own trace it can be very time consuming. You can use the command line in figure 1 to have Tshark do it BUT it requires scanning the trace file N+1 times where N is the number of TCP streams that are identified. This can take many minutes to hours for a large file with many TCP flows.

# date; f=test-short.pcap; for x in $(tshark -r $f -T fields -e tcp.stream | sort -nu); do tshark \
-r $f -R "tcp.stream == $x" -w /tmp/Stream-$x--$f; done; date
Sun Feb  8 11:20:42 MST 2015
. . . .
Sun Feb  8 13:36:48 MST 2015
#

# ls -l Stream*test-short.pcap | wc -l
958
#

Figure 1 - Tshark command to separate streams into individual files

What I wanted was something that would extract all the TCP streams with 1 pass through the file. I could not find anything and so wrote the following Perl script. See figure 2 for a comparison of execution times.

Usage

perl split-pcap.pl TYPE PCAP-FILE

Where:
TYPE is either the string "ether" for frames containing an Ethernet header followed by an IP header or the string "evlan" for frames containing an Ethernet header followed by a VLAN header followed by an IP header or the string "sll" for frames containing a "Linux cooked" header. SLL stands for sockaddr_ll or socket link layer. If the script runs and does not split out any traces chances are you used the wrong type or the trace contains a frame type other than one of these types.

PCAP-FILE is the name of the pcap file. This can be a relative or absolute name. The file must be in pcap format, not pcapng format.

Output consists primarily of a set of files names with the format PCAP-FILE-IPx:TCP_PORTx-IPy:TCP_PORTy.pcap. The IPx:TCP_PORTx:IPy:TCP_PORTy is ordered based on the lowest TCP PORT value so TCP_PORTx < TCP_PORTy. Files are written in the same directory as PCAP-FILE. NOTE that this is the TCP 4-tuple which is NOT quite the same as a Wireshark/Tshark stream. If the port numbers are reused in multiple connections Wireshark/Tshark will recognize this and create multiple streams. This script is not so smart and all packets with the same 4-tuple are grouped together.

There is also a status line that gives an approximate percent complete with the number of bytes processed and total number of bytes. This is approximate but close enough to give you an idea of how long the process with take. There is also a count of the number of flows identified and identified but not separated out because of the open file count limitation. If the "identified but not separated out" number is not zero there will also be a message that the file count has been exceeded (see figure 3). If that happens there will also be a PCAP-FILE-missed-4-tuples file.

Requirements

This script requires that the Perl-Net-Pcap package be installed

Limitations

The split into separate files is based on the 4-tuple <source-ip, TCP-source-port, destination-ip, TCP-destination-port>, if port numbers are reused packets from all streams will be combined in the one file.
As a 4-tuple is recognized a new file is opened to write the corresponding packets. It is possible that the number of file descriptors can be exceeded and no new files can be created. This state is flagged and a count kept of 4-tuples not written. In addition a "missed" file is created listing the 4-tuples.
The 4-tuple is managed with a hash, it is possible that there are so many 4-tuples that the ability to write to the hash is exceeded. If that happens the PCAP-FILE-missed-4-tuples file will not contain all of the missing 4-tuples. You can compare the "identified but not separated out" with the number of lines in the PCAP-FILE-missed-4-tuples file to see if this has happened (see figure 3).

The open file limitation has been eliminated in a new python script. Take a look at https://github.com/noahdavids/packet-analysis/blob/master/split-pcap.py

Examples

Figure 2 shows the processing of the same file as figure 1. Note the time difference, 4 seconds versus 2+ hours.

# date; perl ../y.pl sll test-short.pcap; date                                                     
Sun Feb  8 17:24:07 MST 2015
                    0% (76/76413360) 4-tuple Saved/No Saved count is 0/0
                    0% (144/76413360) 4-tuple Saved/No Saved count is 0/0
                    0% (212/76413360) 4-tuple Saved/No Saved count is 0/0
                    0% (294/76413360) 4-tuple Saved/No Saved count is 1/0
. . . . .
                   100% (76181920/76413360) 4-tuple Saved/No Saved count is 959/0
                   100% (76183131/76413360) 4-tuple Saved/No Saved count is 959/0
                   100% (76183254/76413360) 4-tuple Saved/No Saved count is 959/0
                   100% (76183336/76413360) 4-tuple Saved/No Saved count is 959/0
Sun Feb  8 17:24:11 MST 2015
#

# ls | head -5
test-short.pcap-103.10.4.216:80-192.168.1.200:42583.pcap
test-short.pcap-103.10.4.216:80-192.168.1.200:42584.pcap
test-short.pcap-103.10.4.216:80-192.168.1.200:42585.pcap
test-short.pcap-103.10.4.216:80-192.168.1.200:42586.pcap
test-short.pcap-103.10.4.216:80-192.168.1.200:42589.pcap
#

Figure 2

Figure 3 shows the processing of a larger file with so many flows that not all can be recorded. Note that the "identified but not separated out" counter, 997, in the last message matches the number of lines in the test.pcap-missed-4-tuples file indicating that that the hash limit was not exceeded and all 4-tuples have been identified.

perl ../split-pcap.pl sll test.pcap
                    0% (77/171839696) 4-tuple Saved/No Saved count is 0/0
                    0% (146/171839696) 4-tuple Saved/No Saved count is 0/0
                    0% (215/171839696) 4-tuple Saved/No Saved count is 0/0
                    0% (298/171839696) 4-tuple Saved/No Saved count is 1/0
                    0% (381/171839696) 4-tuple Saved/No Saved count is 1/0
. . . . .
                   46% (78481363/171839696) 4-tuple Saved/No Saved count is 1019/0
                   46% (78481528/171839696) 4-tuple Saved/No Saved count is 1019/0
                   46% (78481619/171839696) 4-tuple Saved/No Saved count is 1020/0
                   46% (78481761/171839696) 4-tuple Saved/No Saved count is 1020/0
Could not open outfile file for 4 tuple 216.58.216.14:80-192.168.1.200:46932. Too many open files  
Countinuing to process.
File limit reached 46% (78481852/171839696) 4-tuple Saved/No Saved count is 1020/1
File limit reached 46% (78481943/171839696) 4-tuple Saved/No Saved count is 1020/2
File limit reached 46% (78483474/171839696) 4-tuple Saved/No Saved count is 1020/2
File limit reached 46% (78484386/171839696) 4-tuple Saved/No Saved count is 1020/2
File limit reached 46% (78484457/171839696) 4-tuple Saved/No Saved count is 1020/2
. . . . .
File limit reached 100% (171610076/171839696) 4-tuple Saved/No Saved count is 1020/997
File limit reached 100% (171610159/171839696) 4-tuple Saved/No Saved count is 1020/997
File limit reached 100% (171610242/171839696) 4-tuple Saved/No Saved count is 1020/997
File limit reached 100% (171610325/171839696) 4-tuple Saved/No Saved count is 1020/997
File limit reached 100% (171610408/171839696) 4-tuple Saved/No Saved count is 1020/997
$

# head test.pcap-missed-4-tuples
104.219.49.71:80-192.168.1.200:49725
104.219.49.71:80-192.168.1.200:49726
104.219.49.78:80-192.168.1.200:60666
104.219.49.78:80-192.168.1.200:60667
104.28.12.19:80-192.168.1.200:33296
104.40.63.98:80-192.168.1.200:40443
104.40.63.98:80-192.168.1.200:40444
107.21.114.74:80-192.168.1.200:38393
107.21.114.74:80-192.168.1.200:38394
107.21.248.242:80-192.168.1.200:44657
$
# wc -l test.pcap-missed-4-tuples
997 test.pcap-missed-4-tuples

Figure 3

Tested with

The script is simple enough that I suspect it will work on any system with a Perl interpreter and the Net::Pcap module. However, I have tested the script with

Red Hat Linux Release 6 update 6

split-pcap.pl

 
#!/usr/bin/perl
#
# split-pcap.pl
#
# version 1.0 2015-01-25
# version 1.1 2015-02-07 added sll frame type processing
# version 1.2 2016-03-03 added evlan frame type for an Ethernet frame with VLAN tags
#
# Usage:
#    perl split-pcap.pl TYPE PCAP-FILE
#
# This script will read the PCAP-FILE and for each Ethernet (TYPE == ether)
# SLL (TYPE = sll), or Ethernet with VLAN tags (type == evlan) frame containing
# an IP/TCP packet will write a file named PCAP-FILE-IPx:TCP_PORTx-IPy:TCP_PORTy.pcap.
# The IPx:TCP_PORTx:IPy:TCP_PORTy is ordered based on the lowest TCP PORT value so
# TCP_PORTx < TCP_PORTy. Files are written in the same directory as PCAP-FILE. Status
# information is written as the PCAP-FILE is read but the percentage complete value
# is approximate.
#
# Known limitations
# 1. The split into spearate file is based on the 4 tuple
#    
#    if port numbers are resued packets from all streams will be combined
#    in the one file.
# 2. As a 4-tuple is recognized a new file is opened to write the
#    corresponding packets. It is possible that the number of file descriptors
#    can be exceeded and no new files can be created. This state is flagged
#    and a count kept of 4-tuples not written. In addition a "missed" file
#    is created listing the 4-tuples
# 3. The 4-tuple is managed with a hash, it is possible that there are so
#    many 4-tuples that the ablity to write to the hash is execeeded.
#
# Testing
# This script has been tested only on Red Hat Enterprise Linux 6x
#
use Net::Pcap;
use strict;
use warnings;

my $type;
my $length;
my $header;
my $size;
my $bytes = 0;
my %fourTuple;
my $fourTupleKey;
my $fourTupleSavedCount = 0;
my $fourTupleNotSavedCount = 0;
my $okToOpen = 1;
my $dumpFile;

if ($#ARGV lt 1)  {
    print "Usage: pcap.pl [ether | sll | evlan] \n";
    exit;
}
if ($#ARGV gt 1 ) {
    print "Usage: pcap.pl [ether | sll | evlan] \n";
    exit;
}

open(INFILE, $ARGV[1]) || die "Can't open $ARGV[1]. $!\n";
$length = read(INFILE,$header,8);
die "Can't read from " . $ARGV[1] . ", $length < 8\n" if $length < 8;
$size = -s $ARGV[1];
close(INFILE);

$type = $ARGV[0];

my ($pcap, $err);
$pcap = Net::Pcap::open_offline ($ARGV[1], \$err) || die "Could not open file " . $ARGV[1] . ". $!";

Net::Pcap::loop ($pcap, -1, \&processPacket, 0);


foreach (sort keys %fourTuple) {
    if ($fourTuple{$_} >0) {
     Net::Pcap::pcap_dump_close ($fourTuple{$_});
    }
  }

open (OUTFILE, ">", $ARGV[1] . "-missed-4-tuples") or die $!;
foreach (sort keys %fourTuple) {
    if ($fourTuple{$_} == -1) {
     print OUTFILE "$_\n";
    }
  }
close (OUTFILE);

sub processPacket {
    my($user_data, $hdr, $pkt) = @_;

    my($etherTypeLoc) = 12;     # start of the Protocol Type for Ethernet frame
    my($sllTypeLoc) = 14;       # start of the Protocol Type for SLL (Linux cooked) frame
    my($etherVlanTypeLoc) = 18;      # start of the Protocol Type for an Ethernet Frame with a VLAN tag
    my($typeOffset);
    my($protoType);
    my($ipHeaderLen);
    my($ipProto);
    my($ipSrcAddr);
    my($ipDstAddr);
    my($tcpSrcPort);
    my($tcpDstPort);
    my($dumpFileName);

    if ($type eq "ether") {
       $typeOffset = 12;
    } elsif ($type eq "sll") {
       $typeOffset = 14;
    } else {
       $typeOffset = 16;
    }

    $protoType = ord (substr($pkt, $typeOffset, 1)) * 256 +
            ord (substr($pkt, $typeOffset+1, 1));

    if ($protoType == 0x0800) {   # Frame type is IP
       $ipHeaderLen = (ord (substr($pkt, $typeOffset+2, 1)) & 0x0F) * 4;
       $ipProto = ord ( substr($pkt, $typeOffset+11, 1));
       if ($ipProto == 6) {   # frame type is TCP
          $ipSrcAddr = sprintf("%d.%d.%d.%d",
             ord( substr($pkt, $typeOffset+14, 1) ),
             ord( substr($pkt, $typeOffset+15, 1) ),
             ord( substr($pkt, $typeOffset+16, 1) ),
             ord( substr($pkt, $typeOffset+17, 1) ));
          $ipDstAddr = sprintf("%d.%d.%d.%d",
             ord( substr($pkt, $typeOffset+18, 1) ),
             ord( substr($pkt, $typeOffset+19, 1) ),
             ord( substr($pkt, $typeOffset+20, 1) ),
             ord( substr($pkt, $typeOffset+21, 1) ));
          $tcpSrcPort = ord (substr($pkt,
                             $typeOffset+2+$ipHeaderLen, 1)) * 256 +
                        ord (substr($pkt, $typeOffset+2+$ipHeaderLen+1, 1));
          $tcpDstPort = ord (substr($pkt,
                             $typeOffset+2+$ipHeaderLen+2, 1)) * 256 +
                        ord (substr($pkt, $typeOffset+2+$ipHeaderLen+3, 1));

          if ($tcpSrcPort < $tcpDstPort) {
             $fourTupleKey = $ipSrcAddr . ":" . $tcpSrcPort . "-" .
                             $ipDstAddr . ":" . $tcpDstPort;
          } else {
             $fourTupleKey = $ipDstAddr . ":" . $tcpDstPort . "-" .
                             $ipSrcAddr . ":" . $tcpSrcPort;
          }

          if ($fourTuple{$fourTupleKey}) {   # file for 4 tuple already open ?
             $dumpFile = $fourTuple{$fourTupleKey};
             if ($dumpFile > 0) {
                Net::Pcap::pcap_dump ($dumpFile, $hdr, $pkt);
             }
          } else {   # else file for 4 tuple already open ?
            if ($okToOpen) { # we haven't had an open error yet
               $dumpFileName = $ARGV[1] . "-" . $fourTupleKey . ".pcap";
               $dumpFile = Net::Pcap::pcap_dump_open ($pcap, $dumpFileName);
               if ($dumpFile) { # did we opened a new dump file
                  $fourTuple{$fourTupleKey} = $dumpFile;
                  Net::Pcap::pcap_dump ($fourTuple{$fourTupleKey}, $hdr, $pkt);
                  $fourTupleSavedCount++;
               } else {   # else did we opened a new dump file
                 print "Could not open outfile file for 4 tuple " . $fourTupleKey .
                       ". $!\n";
                 print "Countinuing to process.\n";
                 $okToOpen = 0;
                 $fourTupleNotSavedCount = 1;
                 $fourTuple{$fourTupleKey} = -1;
               }   # end else did we opened a new dump file
            } else {   # else we haven't had an open error yet
                   $fourTupleNotSavedCount++;
                   $fourTuple{$fourTupleKey} = -1;
            }   # end else we haven't had an open error yet
          }   # end else file for 4 tuple already open ?
       }   # end Frame type is TCP
    }   # end Frame type is IP

    $bytes += (length ($hdr) + length ($pkt));
    if ($okToOpen) {
       printf("%s %2.0f%% (%d/%d) 4-tuple Saved/No Saved count is %d/%d\n",
          "                  ", (100*$bytes/$size), $bytes, $size,
          $fourTupleSavedCount, $fourTupleNotSavedCount);
    } else {
       printf("%s %2.0f%% (%d/%d) 4-tuple Saved/No Saved count is %d/%d\n",
          "File limit reached", (100*$bytes/$size), $bytes, $size,
          $fourTupleSavedCount, $fourTupleNotSavedCount);
    }
}

This page was last modified on 17-07-09

Send comments and suggestions
to noah@noahdavids.org