Log Parser Patterns

How does the parser work?¶

The parser detects log formats based on a pattern library (YAML file) and converts it to a JSON Object:

JSON lines are detected, parsed, and scanned for "@timestamp" and "time" fields (Logstash and Bunyan format)
find matching regex in the pattern library
tag it with the recognized type
extract fields using regex
if 'autohash' is enabled, sensitive data is replaced with its sha256 hash code (or alternative sha512 hash code by configuration)
parse dates and detect date format (use 'ts' field for date and time combined)
create ISO timestamp in '@timestamp' field
call patterns "transform" function to manipulate parsed objects
unmatched lines end up with a timestamp and original line in the message field
Logagent includes default patterns for many applications (see below)

The default pattern definition file comes with patterns for:

MongoDB
MySQL
Nginx
Redis
Elasticsearch
OpenSearch
Webserver (Nginx, Apache Httpd)
Zookeeper
Cassandra
Kafka
HBase HDFS Data Node
HBase Region Server
Hadoop YARN Node Manager
Apache Solr
various Linux/Mac OS X system log files

The file format for pattern definitions is based on JS-YAML, in short:

- - indicates an  array
- !!js/regexp - indicates a JS regular expression
- !!js/function > - indicates a JS function

Properties:

patterns: list of patterns, each pattern starts with "-"
match: group of patterns for a specific log source
blockStart: regular expression indicating a new message block for multi-line logs
sourceName: regular expression matching the name of the log source (e.g. file or container image name)
regex: JS regular expression
fields: field list of extracted match groups from the regex e.g. [url:string, size:number]
type: type used in Logsene (Elasticsearch/OpenSearch Mapping)
dateFormat: format of the special fields 'ts'. If the date format matches, a new field @timestamp is generated. The format string needs to be recognized by date-fns parse function.
transform: JavaScript function to manipulate the result of regex and date parsing

Example¶

# Sensitive data can be replaced with a hashcode (sha1)
# it applies to fields matching the field names by a regular expression
# Note: this function is not optimized (yet) and might take 10-15% of performance
autohash: !!js/regexp /user|password|email|credit_card_number|payment_info/i

# set this to false when autohash fields
# the original line might include sensitive data!
originalLine: false

# activate GeoIP lookup
geoIP: true

# Logagent updates GeoIP DB files automatically
# pls. note write access to this directory is required
maxmindDbDir: /tmp/

patterns: 
 - # APACHE  Web Logs
  sourceName: httpd
  match: 
    # Common Log Format
    - regex:        !!js/regexp /([0-9a-f.:]+)\s+(-|.+?)\s+(-|.+?)\s+\[([0-9]{2}\/[a-z]{3}\/[0-9]{4}\:[0-9]{2}:[0-9]{2}:[0-9]{2}[^\]]*)\] \"(\S+?)\s(\S*?)\s{0,1}(\S+?)\" ([0-9|\-]+) ([0-9|\-]+)/i
      type: apache_access_common
      # map each match above to a field in `fields`
      # the `ts` field is special that gets renamed to the @timestamp field
      fields:       [client_ip,remote_id,user,ts,method,path,http_version,status_code,size]
      # the casing of DD, MMM, YYYY, etc. matters
      dateFormat: DD/MMM/YYYY:HH:mm:ss ZZ
      # another example - note that here a different casing needs to be used
      # dateFormat: yyyy-MM-dd HH:mm:ss,SSS
      # lookup geoip info for the field client_ip
      geoIP: client_ip
      # parse only messages that match this regex
      inputFilter: !!js/regexp /api|home|user/
      # ignore messages matching inputDrop
      inputDrop: !!js/regexp /127.0.0.1|\.css|\.js|\.png|\.jpg|\.jpeg/
      # modify parsed object
      transform: !!js/function >
        function (p) {
          p.message = p.method + ' ' + p.path
        }
      customPropertyMinStatusCode: 399
      filter: !!js/function > 
        function (p, pattern) {
          // log only requests with status code > 399
          return p.status_code > pattern.customPropertyMinStatusCode // 399
        }

The handling of JSON is different - regular expressions are not matched against JSON data. Logagent parses JSON and provides post processing functions in the pattern definition. The following example masks fields in JSON and removes fields from the parsed event.

hashFunction: sha512
# post process journald JSON format
# logagent feature to hash fields
# and a custom property 'removeFields', used in the transform function
json: 
  autohashFields: 
    - _HOSTNAME: true
  removeFields: 
    - _SOURCE_REALTIME_TIMESTAMP
    - __MONOTONIC_TIMESTAMP
  transform: !!js/function >
   function (source, parsedObject, config) {
     for (var i=0; i<config.removeFields.length; i++) {
       // console.log('delete ' +config.removeFields[i])
       delete parsedObject[config.removeFields[i]]
     }
   }

The default patterns are available here. To add more patterns please submit a Pull Request.

Node.js API for the parser¶

Install Logagent as a local module and save the dependency to your package.json

npm i @sematext/logagent --save

Use the Logparser module in your source code

var Logparser = require('logagent-js')
var lp = new Logparser('./patterns.yml')
lp.parseLine('log message', 'source name', function (err, data) {
    if(err) {
      console.log('line did not match any pattern')
    }
    console.log(JSON.stringify(data))
})

How to test log patterns?¶

Use the command line tool 'logagent' to test patterns or convert logs from text to JSON. It reads from stdin and outputs line delimited JSON (or pretty JSON or YAML) to the console. In addition, it can forward the parsed objects directly to Sematext or Elasticsearch.

Test your patterns:

cat myapp.log | bin/logagent -y -n myapp -f mypatterns.yml