Wednesday, 8 April 2015

Extracting Values from JSON Responses in Bash

Disclaimer

My basic premise is that you want to make a throwaway script for exploring or monitoring a JSON REST API and you want to be able to pull the value of a JSON property to feed into the next request, to log or to make a decision about what to do next. In general, if you want to make a maintainable, robust and efficient system which needs to do structured text parsing, you should use a purpose built 3rd party library or framework like jq.

Assumptions

For the purposes of this example, I am making the reasonable assumptions that you have access to a Bash shell, Curl, Python 2.6+ and an up to date version of GNU Grep (OSX users will need to upgrade from the BSD version using Brew).

Extracting JSON Data

An example of an API that returns JSON in the response body is the Way Back When Machine REST API. I'll use a request to get the archive data for example.com in this example:

$ curl http://archive.org/wayback/available?url=example.com 2> /dev/null

Which responds with some archive metadata:

{"archived_snapshots":{"closest":{"available":true,"url":"http://web.archive.org/web/20150119140634/http://me@example.com/","timestamp":"20150119140634","status":"200"}}}

Let's say that it's important to me to have a script make decisions on what to do next depending on the value of "status" returned in the response. It would be hard to use grep to extract this data without formatting the JSON first. This can be achieved using the Python JSON formatter as follows:

$ curl http://archive.org/wayback/available?url=example.com 2> /dev/null | python -m json.tool

{
  "archived_snapshots": {
    "closest": {
      "available": true,
      "status": "200",
      "timestamp": "20150119140634",
      "url": "http://web.archive.org/web/20150119140634/http://me@example.com/"
    }
  }
}

It is then possible to pull out the status itself using a Grep Perl regular expression (-P) and printing out only the matching part of the line (-o). The following regular expression matches any character preceded by "status": " which is not a newline, a comma or a double quote:

(?<="status": ")[^\n,"]*:

Piping it together:

$ curl http://archive.org/wayback/available?url=example.com 2>/dev/null | python -m json.tool | grep -Po '(?<="status": ")[^\n,"]*'

Gives result:

200

If I wanted to extract the value of a numeric or boolean type property I'd have to adjust the expression a little bit by removing the double quote on the end of the initial matcher:

$ curl http://archive.org/wayback/available?url=example.com 2>/dev/null | python -m json.tool | grep -Po '(?<="available": )[^\n,"]*'

Resulting in:

true

To complete the example, the result can be stored directly into a variable for use later in the script:

$ export STATUS=$(curl http://archive.org/wayback/available?url=example.com 2> /dev/null | python -m json.tool | grep -Po '(?<="status": ")[^\n,"]*')

if [[ $STATUS == '200' ]]; then
    echo 'OK'
fi