Tag Archives: logging

PFSense, Suricata, and Splunk: mildly complicated, but very doable

I run a home lab, with a bunch of VMs running vaguely security-related tools, with a PFSense router in front of everything. On PFSense, I am running Suricata on several interfaces. I also collect the data in to a Splunk instance (❤️ developer license).

The Problem

You can forward data into Splunk with syslog. Ingesting syslog in to Splunk is not the easiest way to collect data – ideally you want to use a Splunk Universal Forwarder – (n.b. do not configure Splunk indexers or forwarders to listen for syslog directly! Use a dedicated syslog server!), but PFSense can forward its syslog natively, and Suricata alerts get written to syslog – so why not use that?

Well, because the data that Suricata puts in a syslog event is next to useless.

02/28/2023-20:18:38.695473  [**] [1:2016683:3] ET WEB_SERVER WebShell Generic - wget http - POST [**] [Classification: Potentially Bad Traffic] [Priority: 2] {TCP} 118.232.97.242:56937 -> <REDACTED>:80

A signature name, a source, and a destination. That’s it. Nothing to really let you understand what was happening underneath. What you really want is the Suricata eve.json output, which PFSense very helpfully allows you to enable. This contains a wealth of data, like the raw packet (in base64), decoded protocol elements like HTTP request headers, the URL, a DNS query – but although you can view these in the PFSense web UI, they are not covered by the native log management. What else could we use?

Splunk is easiest when you use a Splunk forwarder. Always use one when you can! However, sometimes it’s not the right choice… technically you can run the Forwarder on PFSense, as it’s FreeBSD, and there is a FreeBSD build of the forwarder. However, it has some drawbacks: you’ll find that although you appear to be able to enable boot-start, it doesn’t actually work. PFSense manages the set of services that run at boot and anything that is not an official PFSense package won’t get started. I’d been running it like this for ages, and given that I only rebooted my firewall once every six months or so, it was only a minor headache. If you’re running it as a production box though, that’s very much not ideal!

So, syslog is out, and a Splunk forwarder is problematic. What next?

Enter syslog-ng

PFSense is extensible, with a number of officially maintained add-on packages. Suricata is one. Another is the versatile syslog management engine, syslog-ng. Almost any way you can imagine of moving a log from one place to another, syslog-ng can do it for you. It could certainly scoop up files that aren’t covered by PFSense’s default syslog forwarding, and send them over to a syslog receiver. But. But.

syslog-ng can also write directly to the Splunk HTTP Event Collector (HEC). In fact, this is exactly what the Splunk Connect 4 Syslog package is under the hood – syslog-ng with a bunch of wrapping around it to make configuring it for a large number of Splunk inputs a bit less work. We don’t have that handy wrapping in PFSense, but it will absolutely let us do the necessary config manually, using an officially PFSense supported method. This post is therefore a step-by-step on how to set that up.

1. Groundwork

1.1 Splunk

You will need to have Splunk HEC set up. This part is covered by Splunk training (look in particular at the System Admin, Data Admin, and Architecting courses) and documentation so I will not rehash it here. However, as a brief summary, you will need to:

  • enable HEC and generate tokens
  • configure a load balancer (this is a non-Splunk item, but it is a critical step if you have a Splunk deployment with multiple indexers; do not direct HEC output at just one indexer in a cluster, it will do bad things to your Splunk deployment)

Tokens for a single-instance deployment can be found in Settings > Data Inputs > HTTP Event Collector. A HEC token looks like this in Splunk Web:

Extract from Splunk web interface showing a token named "sample_token", the edit/disable/delete links, and the token itself which is a GUID-style string of multiple hexadecimal blocks joined by dashes

Consult the documentation linked above for information about how to obtain the token if set up in a distributed deployment.

1.2 PFSense

  • Go to the System > Package Manager screen, search for the syslog-ng package, and install it

1.3 Network

Your PFSense device needs to be able to connect to the address of the HEC endpoint, on the appropriate port. The default port Splunk uses for this purpose is 8088. If using a load balancer, it must be able to connect to all Splunk indexers on the relevant port, and your PFSense device must also be able to connect to the load balancer. Test both of these things before starting to configure syslog-ng.

If you are intending to use a hostname to specify the Splunk instance / load balancer address, make sure that PFSense can resolve the hostname.

2. Configure syslog-ng

The most basic syslog-ng configuration has 3 components: a source, a destination, and a log directive that instructs syslog-ng to send source X to destination Y. The configuration needed for this use case is only a few minor tweaks away from this baseline. To begin configuring, navigate to Services > syslog-ng in the PFSense admin interface. You quite likely will not need to alter anything under the General tab. Configuring specific logging settings is done under the Advanced tab. Click Add to start writing a config.

2.1 The source

The “Object Type” for a source must be… Source. Sorry, no prizes for guessing that one.

Editing a config of syslog-ng in PFsense. There are blank fields labelled "Object Name", "Object Parameters" and "Description"; and a dropdown "Object Type" selected on "Source"

The “Object Name” is a unique identifier for the config stanza you are defining. There are few strict limitations, but it is a recommended convention to prefix source config names with “s_”, destination names with “d_” etc. The remainder should be brief, but descriptive. This config is to read the Suricata eve.json log files, so I have named it “s_suricata_eve_json”.

The “Object Parameters” define what is actually going to happen. To determine what to set, we must understand where and how the data we want to send exists.

PFsense stores Suricata logs in /var/log/suricata. It can run multiple instances of Suricata, one for each firewall interface. Every instance of Suricata gets its own directory within this path, and the logs are in these subdirectories.

Command line listing of /var/log/suricata showing multiple directories named after interfaces

We could write a separate source stanza for each individual file, manually specifying the interface name. However, that way you would need to edit the config whenever you set Suricata on a new interface. We can instead watch all the directories at once.

Editing a config of syslog-ng in PFsense, showing an object of type "Source" with the title s_suricata_eve_json

The wildcard-file option allows collecting multiple files, and can recursively search directories from a specified base path. That’s perfect for our use case. We specify the base-dir option to /var/log/suricata, set recursion to “yes”, and read all files named “eve.json”.

Additionally, the “no-parse” flag is set. This is because the default behaviour of syslog-ng is to attempt to interpret all messages as RFC-compliant syslog messages, where there is a set of default header fields such as syslog priority, timestamp, and host. Suricata eve.json events consist of a JSON object, with no header; trying to parse a syslog header from this results in improperly formatted JSON (and we need it to be valid JSON when it is sent to Splunk). This is the resulting definition:

{
  wildcard-file(
    base-dir("/var/log/suricata/")
    filename-pattern("eve.json")
    recursive(yes)
    flags(no-parse)
  );
};

Write a brief description, save this configuration, then click Add again for the next one.

2.2 The destination

You need to direct the events which are found in the source to your Splunk HEC receiver. Set the “Object Type” as “Destination”. My destination is labelled “d_splunk_suricata_hec”.

The syslog-ng option that allows sending data to HEC is the http() function. In this function we will define the destination (HEC endpoint host and the path “/services/collector/event” which is where Splunk HEC listener receives data), the token generated in step 1.1, and the HTTP body. The body is a JSON object with a specific set of fields that Splunk expects.

PFSense screenshot showing editing of a syslog-ng config set up to send to Splunk HEC

You must replace several elements of this with values specific to your environment:

  • <splunk_hec_endpoint> should be the IP or hostname of your load balancer, or of the Splunk instance if it is an all-in-one instance
  • <generated_hec_token> should be replaced with the token generated in step 1.1
  • In an ideal environment, you will be using proper certificate management with PKI; instead of setting peer-verify(no), you would load your organisation’s certificates into PFSense
  • <index> should be changed to the Splunk index you wish the logs to be sent to

After changing the values it should look something like this:

{
    http(url("https://172.16.3.9:8088/services/collector/event")
        method("POST")
        user_agent("syslog-ng User Agent")
        user("user")
        password("eb2cb049-3091-4b82-af1e-da11349a5e2b")
        peer-verify(no)
        body("{ \"time\": ${S_UNIXTIME},
                \"host\": \"${HOST}\",
                \"source\": \"${FILE_NAME}\",
                \"sourcetype\": \"suricata\",
                \"index\": \"suricata\", 
                \"event\":  ${MSG} }\n")
  );
};

Write a brief Description, save the configuration, and click Add to start writing the final part.

2.3 The log directive

Now that a source and destination have been defined, they can be connected together with a third stanza, where the “Object Type” is “Log”. This is the simplest of the three, and looks like so:

PFSense admin page showing a syslog-ng log stanza being configured

The source() function uses the Object Name chosen in step 2.1; the destination() function takes the name chosen in 2.2. Add these in, set an object name and description for this stanza, and save – and you should be rolling!

{ 
  source(s_suricata_eve_json);
  destination(d_splunk_suricata_hec); 
};

3. Checking your work

The first place to look is in the index you set as destination for Suricata events. Depending on how busy the device is, you might get dozens of events a minute, or only a few per hour. If you don’t see anything, try looking in the following places to see why:

3.1 Suricata logs on PFSense

You can see the events as they are written on the device under Services > Suricata > Logs View. If you have command line access you can also look in the filesystem at /var/log/suricata/<interface name>/eve.json.

3.2 syslog-ng logs on PFSense

Under Services > syslog-ng > Log Viewer, you can see recent messages from the syslog-ng service. Possible errors you could see here, and their causes include:

error sending HTTP request; url='https://<your host>:8088/services/collector/event', error='Couldn\'t resolve host name' 

PFSense could not look up the specified hostname via DNS; redo step 1.3

curl: error sending HTTP request; url='https://<your host>:8088/services/collector/event', error='SSL peer certificate or SSH remote key was not OK' 

the certificate is not trusted – you should specify peer-verify(no) if this is expected

Server returned with a 4XX (client errors) status code, which means we are not authorized or the URL is not found.; url='https://<your host>:8088/services/collector/event', status_code='400' 

the request wasn’t formatted correctly; you may have made a typo when constructing the text in the body() function

3.3 HEC logs in Splunk

If syslog-ng connects successfully but submits bad information, Splunk HEC will log an error. You can search for this with:

index=_internal component=HttpInputDataHandler

If the problem is badly formatted data, the messages aren’t hugely informative, but they are at least enough to confirm roughly what’s going on.

10-11-2023 20:31:45.580 +0100 ERROR HttpInputDataHandler [23746 HttpDedicatedIoThread-0] - Failed processing http input, token name=pfsense_syslog_ng, channel=n/a, source_IP=<syslog-ng source IP>, reply=6, events_processed=0, http_input_body_size=1046, parsing_err="While expecting event object key: Unexpected character: ':', totalRequestSize=1046"

When I encountered this, the only way I could think of to see what the problem was, was to write a second destination for syslog-ng where it would write events to a new file, using the same formatting text used in the body() function of the http() destination. I could then read the file and figure out which bit of the JSON was incorrect.

Hopefully now that I’ve shown exactly which bits to alter in this guide, you won’t have a need for that level of debugging! If you find messages like this, first re-read section 2.2 and check your destination stanza very carefuly against the example, for missing or extra characters.

4. Wrap up

If all went well, you now have all the eve.json events in Splunk, in all their lovely detail. If this has been helpful, I’d love to hear from you – or if there’s anything wrong or missing, please let me know. Happy Splunking!

Collecting Netscaler web logs

A little while ago I wrote about collecting AppFlow output from a Citrix Netscaler and turning it into Apache-style access logs. Whilst that might technically work, there are a few drawbacks – first and foremost that Logstash gobbles CPU cycles like nobody’s business.

Furthermore, since the Netscaler outputs separate AppFlow records for request and response, if you want a normal reverse proxy log, you need to put them back together yourself. Although I have already described how to achieve that, as you can see above it is also not terribly efficient. So, is there a better way? There certainly is!

NetScaler Web Log Client

In order to deliver responses to requests correctly, the Netscaler must track the state of connections internally. Instead of creating our own Frankenstein’s Monster of a state machine to reassemble request and response from AppFlow, it would be much simpler if we could get everything from a place that already has the combined state. The good news is that Citrix have provided a client application to do just that. The bad news is that their documentation is a little on the shonky side, and it isn’t always clear what they mean. To fill in some of the gaps, I have written a brief guide to getting it running on CentOS 7. I will assume for this that you have installed CentOS 7 Minimal and updated it through yum.

Obtain the client

Citrix’s description of where to find the client on their site isn’t terribly helpful. Here’s how to get there at the time of writing:

    Citrix Portal > Downloads > Citrix Netscaler ADC > Firmware > [your version] > Weblog Clients

Prep the Netscaler

Ensure Web logging is turned on

    System > Settings > Configure Advanced Features > Web Logging

Ensure remote authentication is OFF for the nsroot user (not expecting many people to encounter this problem but it’s not easy to troubleshoot – the client just shows an authentication failure even if you entered the password correctly)

    System > User Administration > Users > nsroot > Enable External Authentication

Install and configure the NSWL client

Extract the .rpm from the zip downloaded from the Citrix portal and transfer it to your CentOS system. Run the following commands as root:

    $> yum install glibc.i686
    $> rpm -i nswl_linux-[citrix_version].rpm

You need to be able to connect from the system you are running the client on to your Netscaler reverse proxy on port 3011.

    $> nc -v [netscaler_ip] 3011

Add the target IP and nsroot account credentials to the config file as described in the Citrix docs (yes, some of their instructions are accurate – just not everything):

    $> /usr/local/netscaler/bin/nswl -addns -f /usr/local/netscaler/etc/log.conf

Edit the config file to set the format, log output directory, rotation settings etc.

----extract from /usr/local/netscaler/etc/log.conf----
logFormat    NCSA %h %v %l %u %p [%t] "%r" %s %j %J %{ms}T "%{referer}i" "%{user-agent}i"
logInterval			Daily
logFileSizeLimit		1024
logFilenameFormat		/var/log/netscaler/nswl-%{%y%m%d}t.log
------------------------------------------------------

Note: Citrix do not appear to provide a complete breakdown of what format strings are accepted, so I used the Apache documentation as a reference. However, not all of the variables are supported by the NSWL client, and some work in a different manner than expected. For example, %D does not output microseconds, but the %{UNIT}T style does work.

Configure a service to run the NSWL client

    $> vim /etc/systemd/system/nswl.service

[Unit]
Description=nswl

[Service]
Type=simple
User=nswl
Group=nswl
ExecStart=/usr/local/netscaler/bin/nswl -start -f /usr/local/netscaler/etc/log.conf	

[Install]
WantedBy=multi-user.target

    $> useradd -d <log directory> -s /sbin/nologin nswl
    $> chown -R nswl:nswl <log directory>
    $> systemctl daemon-reload
    $> service nswl start

SIEM configuration and log rotation

The logFormat directive shown above is similar to the standard Apache Combined format, but not identical. To parse the output, a slightly tweaked version of the regex is necessary:

^(?<src_ip>\S+) (?<site>\S+) (?:-|(?<ident>\S+)) (?:-|(?<user>\S+)) (?<dest_port>\d+) \[[^\]]*] "(?<request>[^"]+)" (?<status>\d+) (?<request_bytes>\d+) (?<response_bytes>\d+) (?<response_time>\d+) "(?:-|(?<http_referer>[^"]*))" "(?:-|(?<http_user_agent>.*))"

You should use a prefix pattern to match files to collect – do NOT use a suffix pattern like ‘*.<extension>‘ to track files. The NSWL client creates a new file with ‘.<number>‘ appended under many circumstances, including when the service is restarted, when the logFileSizeLimit is reached, and others. For example, if the service was restarted while writing to ‘nswl-20191001.log‘, it would begin writing ‘nswl-20191001.log.0‘.

Make sure to take this into account when configuring log rotation – e.g. move the files before compressing: ‘$> gzip nswl-20191001.log‘ results in ‘nswl-20191001.log.gz‘, which matches the pattern ‘nswl-*‘; SIEM agents may consider the latter file to be new and index it again, resulting in duplicate data.

Results

Using 1% CPU and a single process as opposed to the previous method of attempting to melt a CPU into the motherboard substrate is a definite improvement. Another plus is that it’s an officially supported tool, so in theory if something’s not working you can actually get some help with it.

I’m pretty proud of my eldritch horror of a python script, it ran for nearly two years in production with no significant problems (unlike Logstash which needed CPR every 6 weeks or so), but it’s high time my code was retired.

Mounting image with python-guestfs

Automating a sandbox: Evidence Collection

20,000 Leagues Under The Sand: Part 5

read part 4

You may have a tricked-out sandbox that logs host activity, does packet capture and IDS, and will make you a slice of toast, but none of the bells and whistles will do you any good without collecting the information and putting it in front of your eyes. The techniques required will test your knowledge of network and file system forensics, as well as your skill with code. Let’s start with an easy one.

Suricata logs

If you have followed the suggestions made earlier in this series, Suricata will be writing events to files in /var/log/suricata/ in JSON form, one object per line. This lends itself to ease of use; pretty much any language will have a good JSON parsing library. All you will need to do is filter for entries based on the timestamp being within the period you were running your malware sample.

Be aware that the Suricata log does not get truncated unless you have specified. If you read and filter the log using  the simplest method (line-by-line read from the start, parsing each event then filtering), this will eventually become very slow. You should consider rotating the file, either yourself or using Suricata’s built in rotation, and make sure that your parsing and filtering takes account of this rotation.

Packet capture

As mentioned in the post discussing networking, you can either create a per-run packet capture as part of your code (assuming your language has the appropriate libraries), or a systemwide one which you can then extract portions of.

If you only ever plan to have one guest VM sandboxing malware at a time, the per-run capture should be fine and relatively simple. If you are slightly nuts ambitious like me and want to design for the possibility of several in parallel, a systemwide capture would be more suitable. Again, depending on the way you have organised capture, you should make sure your code accounts for the rotation of the pcaps.

Host activity/event logs

Early on in this series I waxed lyrical about the advantages of Sysmon. I am not going to contradict any of that here, but collecting its output is not as simple as you might think. Windows event logs get written to EVTX files, but not necessarily immediately. Therefore although an event may be generated, its presence in the EVTX file is not guaranteed. Under testing I have found that not even a shutdown is a guarantee of the events being written to the file. The only method I have found to be 100% reliable is to query the Windows Event Log API¹. Therefore, to collect Sysmon logs in a reliable fashion, you need to be able to use the Windows API.

I am aware of two methods for doing this. The first is to write a program which queries the API, and run that in your sandbox. You can then write the data to a file, or send it out of the sandbox immediately. To send it out of the sandbox you could have a service on the host listening on the virtual network interface, such as an FTP or HTTP server.

The second method would be to use Windows Event Forwarding. This is a tremendously useful technique for blue teamers and comes highly recommended by Microsoft staff. It does, however, require you to have a second Windows host on which to collect the events, which may not be an option for you. Most documentation you will find on this will refer to setting it up in an Active Directory environment, however it is also capable of running in workgroup-only systems.

¹ I strongly suspect that the events are being written to temporary files but at the time of writing this is little better than a hunch. I’ll chase down my suspicion at some point and if it’s right there’ll be a new post about my findings.

Filesystem collection

Getting events is a huge win, and might well be all you need; but why not go one step further? Malware drops and modifies files and writes to the registry, and if you could get your hands on that evidence, it could be invaluable. Another of the reasons for choosing LibVirt/QEMU as my hypervisor was the availability of python bindings for LibGuestFS, allowing me to directly mount and read QEMU disk images. However, you should still be fine with other hypervisors: VMWare also provides a utility for this, and VirtualBox can apparently be mounted as a… network block device? Please can I have some of whatever Oracle have been smoking, because it’s clearly the good shit.

Detailed coverage of the options for filesystem evidence collection could run to several blog posts of its own, so I won’t go into everything here. However, I will describe three approaches, each with their own advantages and drawbacks.

  • Diffing from a known-good state

The slowest, but most comprehensive method. Requires building a comprehensive catalogue of the hashes of all files on the disk prior to malware execution, and another one after, and identifying the differences. Not recommended unless you are truly desperate to roast your CPU with hash calculations.

  • Metatadata-based selection

Since you know the lower and upper time bounds for possible activity by the malicious sample, you can walk the directory tree and select only items which have been changed or created in that period. Relatively quick, but some malware is known to modify the MFT record with false created/modified values, known as ‘timestomping’.

  • Key items and locations

The majority of malware activity is limited to just a few locations. Taking a copy of the user directory, and SYSTEM and SOFTWARE registry hives, plus a couple of other items, would capture the traces left by most samples you might ever run.

There is a final option for collection of file-based evidence, and that is to use a host agent which collects the files as the malware writes them. The above methods would fail to capture a file that has been created and subsequently removed. In an earlier post I mentioned that if you were so inclined, you could write code which would monitor API calls yourself. Doing this would also give you the ability to capture temporary files in addition to the ones which are left behind.

Hopefully you now have an idea of the approaches you can use to gather useful information from the execution of a malware sample without the need for manual intervention. The final post in my series considers anti-analysis techniques and countering sandbox evasion.

Host activity monitoring

20,000 Leagues Under The Sand – Part 2

read part 1

As a newbie sandboxer, the biggest obstacle for me was finding a way of getting in-depth information on what actions were being performed by malware I wanted to test. In particular, I wanted to be able to drop some samples, go away and make lunch, then come back and be looking at some results. That meant stepping through it in a debugger was out, or at least a lesson for another day. You’ve probably already seen that I ended up using Sysmon, but let’s have a look at the alternatives for a moment.

Built in Windows logging

Filesystem forensics

  • The files in C:\Windows\Prefetch\ can show if executables were run
  • The AppCompatCache registry key and AmCache.hve hive contain more detailed information on program execution, though neither logs individual execution instances or command line options
  • You can diff the filesystem – have a clean copy, either of the Master File Table or of the entire structure – and compare to see what’s changed; this is a fairly intensive operation, especially if you intend to see if a known good file has been replaced with a malicious version
  • There are tools for parsing registry hives so identifying new/modified keys is possible

Creating your own API call logging

  • If you’re a good enough programmer to write code that logs API calls, this is the gold standard. I am not (yet) up to this. It is possible to monitor for most of the interesting events such as process and file creation, registry modification etc. using filter drivers. If you want to go a step further and monitor (or even intercept and change) system calls, you need to be looking at DLL injection. This is the method used by Cuckoo sandbox, among many others.

Building monitoring in to the virtualisation

  • Technically this is all just code simulating hardware running other code. If you’re smart enough to modify a hypervisor so that it can recognise and log API calls within its guests, go for it. Please excuse me for thinking you’re a bit mad though!

Options #1, #2 and #4 hold an additional advantage of being difficult or impossible for sandbox evasion techniques to pick up on.

And then we get to Sysmon, which is in effect a version of #3, but it has a big advantage: somebody else did all the work for us! Hooray for Mark Russinovich and Thomas Garnier. Many sandboxes do API call monitoring; sometimes it can be a little bit excessively detailed (hello Cuckoo) but as far as understanding what malware is doing goes, it’s the bee’s knees. Let’s have a look at what you can get out of it.

Sysmon ProcessCreate event output

Sysmon Process Created event

We’ll ignore for now how much my UI leaves to be desired. Here is perhaps the most commonly of interest event to you: Process Created. In this event you have a wealth of data including not only the location of the executable, launch command and parent processes, but the MD5 and SHA256 hashes of the file. You can also get the import hash here too – though I’d forgotten to turn it on for this run. You can see what ran, from where, by whom, and how it was run, in a glance.

Sysmon File Created event output

Sysmon File Created event

Next up we can log the act of creating a file; in this case a trojan makes new copy of itself which is placed in C:\Users\<username>\System\Library\mshost.exe.

Sysmon Registry Value Set event data

Sysmon Registry Value Set event

You can also monitor for interesting things happening in the registry. This is one of the primary methods by which malware achieves “persistence”, i.e. the ability to remain active on the system it infects. Here we can see a new entry being created in one of the user’s Run keys.

Sysmon Network Connection output

Sysmon Network Connection event

In a final example, Sysmon allows you to detect initiation of network connections; not only do we have the network level data of the destination IP and port captured, but the destination hostname is also identified.

In just four event types, Sysmon is able to record the malware starting, hiding itself, achieving persistence, and contacting its Command and Control server. This is the power of logging API calls. But wait – there’s more! This only scratches the surface of what Sysmon can do. It is also capable of identifying:

  • A process changing the creation time of a file
  • Process termination
  • Loading of drivers
  • Loading of additional modules in to existing processes
  • Creation of threads within other running processes
  • Raw access to the disk (as opposed to using the file system APIs)
  • Access to another process’s memory
  • Creation of alternate data streams
  • Use of named pipes (a method of communicating between processes)
  • Use of Windows Management Instrumentation

As you can see, it’s a fantastic tool which would be pretty hard to top if you decided to try doing this yourself. If you are thinking of experimenting with malware – or looking for something to help you keep a closer eye on your systems in general – I can’t recommend it enough.

In part 3 I will discuss the use of IDS and packet capture tools to get detailed information on the malware’s communication.

Collecting Netscaler appflow reverse proxy logs

TL/DR: python script to combine Netscaler reverse proxy HTTP request and response objects and output the result as an Apache-style log. Github repo is here: https://github.com/scherma/ipfixwatch

So the time came where your organisation decided it needs a new and shiny reverse proxy, hopefully before the current bucket of bolts went to the great big datacentre in the sky. It’ll be fun, they said. I told them we needed to talk about the definition of fun. They said they’d schedule a meeting.

This is not the right place to provide (nor am I really qualified) to give an in depth explanation of appflow; the short version is that it is a binary protocol for logging application-level information in a form that is more flexible than syslog. It has the benefit of having a well-defined structure, which is a plus from a log collection perspective, but being binary means parsing it is tricky and requires specialised tools.

So how can you get the juicy details out of that content? Easier said than done. Citrix will happily sell you an appliance; I leave it to the reader to imagine how many arms and legs the price might be. Ditto Splunk. Then there are the free/OSS options, which is where we arrive at Logstash.

Logstash can receive appflow (or ipfix/netflow) data, parse it, and output information summarising each appflow record. This is great and works (mostly). But when one starts looking at the output, a fly appears in the ointment: requests and responses are logged in separate records. This means that if you’re looking to replace your existing logs like for like, you could have a problem on your hands. Let’s take a look at some of the data. Here is the output in Logstash’s standard json format for a HTTP request:

{
    "@version":"1",
    "host":"10.0.50.4",
    "netflow":{
        "destinationIPv4Address":"10.0.50.5",
        "netscalerHttpReqUserAgent":"Mozilla/5.0",
        "destinationTransportPort":443,
        "netscalerHttpReqUrl":"/some/random/path?with=parameters&other=stuff",
        "sourceIPv4Address":"123.234.123.234",
        "netscalerHttpReqMethod":"GET",
        "netscalerHttpReqHost":"internalhost.unsafehex.com",
        "netscalerHttpReqReferer":"",
        "sourceTransportPort":1337,
        "netscalerHttpDomainName":"netscalerdemo.unsafehex.com",
        "netscalerTransactionId":2459675,
        "netscalerHttpReqXForwardedFor":""
    },
    "@timestamp":"2017-11-08T14:59:58.000Z",
    "type":"ipfix"
}

Plenty of useful information, but as you can see, nothing to indicate what response this request got. This is because the raw appflow packets from the Netscaler output the request and response as separate records, and Logstash is doing a literal translation of each binary record into a separate JSON object. Now let’s have a look at the corresponding response record:

{
    "@version":"1",
    "host":"10.0.50.4",
    "netflow":{
        "destinationIPv4Address":"123.234.123.234",
        "destinationTransportPort":1337,
        "sourceIPv4Address":"10.0.50.5",
        "netscalerHttpRspStatus":200,
        "sourceTransportPort":443,
        "netscalerHttpRspLen":35721,
        "netscalerTransactionId":2459675
    },
    "@timestamp":"2017-11-08T14:59:59.000Z",
    "type":"ipfix"
}

Fortunately for us, although it does not provide us the request and response in a single record, it does mark them as belonging to the same transaction, via the netscalerTransactionId field. This means that we are now able to combine the information and produce the information you might be expecting in an HTTP log.

Having discovered this this, I was able to throw together a python script which will read the JSON output of Logstash and rebuild the separate records into a unified message. At the time the Apache Extended format was the most suited for my requirements and so the current version of my script (here) writes this out to hourly log files. Given that the data becomes a python dict, it would be very easy to adapt this to whatever other format you are interested in.

The code’s clearly a hack job so if anyone feels like turning it into good code, I’d welcome the education. In any case, happy parsing!

A few notes:

  • ipfix, and hence appflow, are UDP protocols. There is no guarantee that Logstash will capture the corresponding response for every request, or vice versa. The script maintains a dict of currently-unmatched requests and responses for which there is a size limit to prevent it eating all of your memory. While Logstash is operating normally I have not seen any issues with unpaired request/response but it is theoretically possible.
  • If the script cannot match a request with a response, it will stay in memory until the script is stopped with SIGINT or keyboard interrupt, or the table size limit is reached. At this point, unpaired requests will be written to the output with a response code of 418. Unpaired responses will be discarded.
  • It won’t auto-create the output directory at this point. Sorry, forgot that part, I made them manually and didn’t think of it.

syslog-ng flat file collection: where did my program go?!

Using syslog-ng to forward logs is pretty nice, there’s plenty of documentation and the configuration is relatively easy to understand compared to other stuff out there (looking at you rsyslog), but that doesn’t mean everything is completely obvious. If you search for information on how to read a text file log with syslog-ng, you might come up with something like this:

source s_squid3access {
    file("/var/log/squid3/access.log" follow-freq(1)); };

Which checks the file /var/log/squid3/access.log for new entries every second. However, if you simply send this as is, you might end up with a message similar to the one below being sent to your syslog destination (note that I’ve modified my squid instance to log in the Apache Combined log format)

<13>1 2017-11-07T19:07:44+00:00 myproxy 192.168.1.4 - - [meta sequenceId="84"] - - [07/Nov/2017:19:07:43 +0000] "CONNECT www.netflix.com:443 HTTP/1.1" 200 12237 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36" TCP_MISS:HIER_DIRECT

which corresponds to the following line in the log file:

192.168.1.4 - - [07/Nov/2017:19:09:03 +0000] "CONNECT www.netflix.com:443 HTTP/1.1" 200 5101 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36" TCP_MISS:HIER_DIRECT

Note the position of the IP address in the syslog message – it is in the syslog header section in the position of the syslog program field. If you want to collect and parse this information in a SIEM for example, this will cause you quite the headache, as the message it attempts to interpret will begin “– – [07/Nov/2017:19:07:43 +0000]” rather than “192.168.1.4 – – [07/Nov/2017:19:07:43 +0000]“. It will no longer match the format of the data the collector is expecting (probably resulting in the message not getting parsed at all), and even if you tried fixing that, the missing data is a fairly crucial bit of info – there’s not much use knowing that someone visited a site if you can’t find out which user or host it was.

The answer is given in the syslog-ng documentation, although it is not immediately obvious. The section on collecting messages from text files hints that if the message does not have a syslog header, it may behave in an unusual way, but it does not explain in detail what will happen; for that you must look at the options for the file() method. In the description of the flags() option “no-parse” it notes that by default, syslog-ng assumes the first value in a log line is the syslog program. If you set this flag, your originating IP will again be part of the message section, and your SIEM/parsing will be happy again.

You can also set the program_override() option so that the program field is populated, as it is useful in certian SIEM/collection tools to have this info. Now your config file might look a bit like this:

source s_squid3access {
    file("/var/log/squid3/access.log" follow-freq(1) flags(no-parse) program_override("squid3")); };

and all should be well. Happy logging!