How do you inject valuable data into your test platforms? How do you provide your data to external stakeholders for investigation? Don’t you have to face personal data issues?

In our missions, we have to handle customer data and they don’t want us to have access to sensitive information or personal information. We don’t want it either.

For instance, we sometimes need to analyze a customer’s logfile that contains sensitive data such as username or internal address. We thought it could be very useful to replace real and sensitive data with anonymous tokens (with a one-to-one correspondence) without hindering analysis.

This is sometimes useful to comply with the following requirements:

  • Logs are stored for research or training purposes after the log analysis job

has been completed, and no data that allows identification of the customer should be present.

  • Privacy laws (ex. CNIL regulation in France) or regulations prevent the

customer from giving over sensitive or personal data.

So, we wrote a tool that was designed to replace sensitive fields in customer’s logs with anonymized values, while generating a lookup table.

A typical process involves the following steps:

  1. This script is typically run by the customer, or by a log analyst as the first step of the log analysis process.
  2. The customer or a log analyst removes log lines relating to internal customer resources – typically, intranet websites.
  3. Analysts then work on the anonymized logs, and come up with lists of anonymized user identifiers, or anonymized IP addresses.
  4. These anonymized tokens may be looked up in the lookup table, to obtain the original value.

Here are sample commands that may be used to anonymize logs.

Sample log line:

192.168.1.1 johnsmith - [01/Jan/2000:00:00:00 +0200] "CONNECT tunnel://accounts.google.fr:443/" 200 3154 TCP_MISS:DIRECT 115358 DEFAULT_CASE <IW_srch,5.9,"0","-",0,0,0,"1","-",-,-,-,"-","1",-,"-","-",-,-,IW_srch,-,"Unknown","-","Unknown","Unknown","-","-",0.22,0,-,"-","-">

Create a file named “filters” that contains pattern to filter out access to internal resources, such as the intranet’s fqdn. Use the following command to filter these out:

grep -VFf filters logfile

If you want to anonymise the first two fields (IP and user name), run the anon.py as shown

./anon.py -f 1=USER,0=SRCIP -t lookupTable.json -F ' ' -i logfile

The script has generated a lookup table in lookupTable.json with the following content:

{"0": {"192.168.1.1": "SRCIP_1"}, "1": {"johnsmith": "USER_1"}}

The script has also generated an anonymized log line as output:

SRCIP_1 USER_1 - [01/Jan/2000:00:00:00 +0200] "CONNECT tunnel://accounts.google.fr:443/" 200 3154 TCP_MISS:DIRECT 115358 DEFAULT_CASE <IW_srch,5.9,"0","-",0,0,0,"1","-",-,-,-,"-","1",-,"-","-",-,-,IW_srch,-,"Unknown","-","Unknown","Unknown","-","-",0.22,0,-,"-","-">

Find out more in the documentation folder of the repository!

Source, documentation and unit tests can be found here.