Parse logs
How does the parser work?¶
The parser detects log formats based on a pattern library (YAML file) and converts it to a JSON Object:
- JSON lines are detected, parsed, and scanned for "@timestamp" and "time" fields (Logstash and Bunyan format)
- find matching regex in the pattern library
- tag it with the recognized type
- extract fields using regex
- if 'autohash' is enabled, sensitive data is replaced with its sha256 hash code (or alternative sha512 hash code by configuration)
- parse dates and detect date format (use 'ts' field for date and time combined)
- create ISO timestamp in '@timestamp' field
- call patterns "transform" function to manipulate parsed objects
- unmatched lines end up with a timestamp and original line in the message field
- Logagent includes default patterns for many applications (see below)
The default pattern definition file comes with patterns for:
- MongoDB
- MySQL
- Nginx
- Redis
- Elasticsearch
- Webserver (Nginx, Apache Httpd)
- Zookeeper
- Cassandra
- Kafka
- HBase HDFS Data Node
- HBase Region Server
- Hadoop YARN Node Manager
- Apache Solr
- various Linux/Mac OS X system log files
The file format for pattern definitions is based on JS-YAML, in short:
- - indicates an array - !!js/regexp - indicates a JS regular expression - !!js/function > - indicates a JS function
Properties:
- patterns: list of patterns, each pattern starts with "-"
- match: group of patterns for a specific log source
- blockStart: regular expression indicating a new message block for multi-line logs
- sourceName: regular expression matching the name of the log source (e.g. file or container image name)
- regex: JS regular expression
- fields: field list of extracted match groups from the regex e.g.
[url:string, size:number]
- type: type used in Logsene (Elasticsearch Mapping)
- dateFormat: format of the special fields 'ts'. If the date format matches, a new field @timestamp is generated. The format string needs to be recognized by date-fns parse function.
- transform: JavaScript function to manipulate the result of regex and date parsing
Example¶
# Sensitive data can be replaced with a hashcode (sha1) # it applies to fields matching the field names by a regular expression # Note: this function is not optimized (yet) and might take 10-15% of performance autohash: !!js/regexp /user|password|email|credit_card_number|payment_info/i # set this to false when autohash fields # the original line might include sensitive data! originalLine: false # activate GeoIP lookup geoIP: true # Logagent updates GeoIP DB files automatically # pls. note write access to this directory is required maxmindDbDir: /tmp/ patterns: - # APACHE Web Logs sourceName: httpd match: # Common Log Format - regex: !!js/regexp /([0-9a-f.:]+)\s+(-|.+?)\s+(-|.+?)\s+\[([0-9]{2}\/[a-z]{3}\/[0-9]{4}\:[0-9]{2}:[0-9]{2}:[0-9]{2}[^\]]*)\] \"(\S+?)\s(\S*?)\s{0,1}(\S+?)\" ([0-9|\-]+) ([0-9|\-]+)/i type: apache_access_common fields: [client_ip,remote_id,user,ts,method,path,http_version,status_code,size] dateFormat: DD/MMM/YYYY:HH:mm:ss ZZ # lookup geoip info for the field client_ip geoIP: client_ip # parse only messages that match this regex inputFilter: !!js/regexp /api|home|user/ # ignore messages matching inputDrop inputDrop: !!js/regexp /127.0.0.1|\.css|\.js|\.png|\.jpg|\.jpeg/ # modify parsed object transform: !!js/function > function (p) { p.message = p.method + ' ' + p.path } customPropertyMinStatusCode: 399 filter: !!js/function > function (p, pattern) { // log only requests with status code > 399 return p.status_code > pattern.customPropertyMinStatusCode // 399 }
The handling of JSON is different - regular expressions are not matched against JSON data. Logagent parses JSON and provides post processing functions in the pattern definition. The following example masks fields in JSON and removes fields from the parsed event.
hashFunction: sha512 # post process journald JSON format # logagent feature to hash fields # and a custom property 'removeFields', used in the transform function json: autohashFields: - _HOSTNAME: true removeFields: - _SOURCE_REALTIME_TIMESTAMP - __MONOTONIC_TIMESTAMP transform: !!js/function > function (source, parsedObject, config) { for (var i=0; i<config.removeFields.length; i++) { // console.log('delete ' +config.removeFields[i]) delete parsedObject[config.removeFields[i]] } }
The default patterns are available here. To add more patterns please submit a Pull Request.
Node.js API for the parser¶
Install Logagent as a local module and save the dependency to your package.json
npm i @sematext/logagent --save
Use the Logparser module in your source code
var Logparser = require('logagent-js') var lp = new Logparser('./patterns.yml') lp.parseLine('log message', 'source name', function (err, data) { if(err) { console.log('line did not match any pattern') } console.log(JSON.stringify(data)) })
How to test log patterns?¶
Use the command line tool 'logagent' to test patterns or convert logs from text to JSON. It reads from stdin and outputs line delimited JSON (or pretty JSON or YAML) to the console. In addition, it can forward the parsed objects directly to Sematext or Elasticsearch.
Test your patterns:
cat myapp.log | bin/logagent -y -n myapp -f mypatterns.yml