Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Using JSON schema to validate your data

3 min read

Recently we added a couple of neat functions which let you work with data more efficiently. So one of these functions is JSON schema support. JSON schema can be used in many cases, e.g., if you need to ensure that digger still works appropriately and data you are getting is still in good state, or if you need to get just specific records and skip others. For example, if you are gathering some events, you may want to get the only event that not canceled or has open slots, if a website has information about it, you can easily set rules in a JSON scheme to pick only records you need.

So what is JSON schema? As json-schema.org states: “JSON Schema is a vocabulary that allows you to annotate and validate JSON documents.”. I would recommend you to learn more about it from the above site, as we are not going to cover syntax and JSON schema usage in this article. You can quickly learn it and play with it in debug mode at Diggernaut without paying a dime for it.

So how can you set JSON schema for a digger? First, you need to login to your Diggernaut account, then go to Projects > Diggers, find digger you need and click on “Config” button.

jsonschema1

It opens editor panel where you usually put in digger config. You can see that it has 2 additional tabs now. You need to click on the “Validator” tab.

jsonschema2

Then you have to put your JSON schema and click on the “Save” button.

jsonschema3

Next time your digger is running, it applies your JSON scheme for data validation. To understand it better, you may want to look into digger config we used for tests:

---
config:
    debug: 2
do:
  - link_add: 'https://diggernaut.com/sandbox/'
  - walk:
      to: links
      do:
        - sleep: 1
        - find:
            path: .result-content
            do:
              - variable_clear: name
              - variable_clear: descr
              - find:
                  path: h3
                  do:
                    - parse
                    - variable_set: name
              - find:
                  path: p
                  do:
                    - parse
                    - variable_set: descr
              - find:
                  path: table
                  do:
                    - find:
                        path: 'tbody > tr'
                        do:
                          - object_new: item
                          - variable_get: name
                          - object_field_set:
                              object: item
                              field: name
                          - variable_get: descr
                          - object_field_set:
                              object: item
                              field: descr
                          - find:
                              path: .col2
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: number
                          - find:
                              path: .col3
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: short_descr
                          - find:
                              path: .col4
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: location
                          - find:
                              path: .col5
                              do:
                                - object_new: date
                                - find:
                                    path: ' .nowrap:nth-child(1)'
                                    do:
                                      - parse
                                      - object_field_set:
                                          object: date
                                          field: start
                                - find:
                                    path: ' .nowrap:nth-child(2)'
                                    do:
                                      - parse
                                      - object_field_set:
                                          object: date
                                          field: end
                                - object_save:
                                    name: date
                                    to: item
                          - find:
                              path: .col6
                              do:
                                - object_new: time
                                - find:
                                    path: ' .nowrap:nth-child(1)'
                                    do:
                                      - parse
                                      - object_field_set:
                                          object: time
                                          field: start
                                - find:
                                    path: ' .nowrap:nth-child(2)'
                                    do:
                                      - parse
                                      - object_field_set:
                                          object: time
                                          field: end
                                - object_save:
                                    name: time
                                    to: item
                          - find:
                              path: .col7
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: days
                          - find:
                              path: .col8
                              do:
                                - parse:
                                    filter:
                                      - "\\s*\\$\\s*(\\d+)\\/"
                                      - "\\s*\\$\\s*(\\d+)"
                                - object_field_set:
                                    object: item
                                    type: int
                                    field: member_fee
                                - parse:
                                    filter:
                                      - "\\s*\\/\\s*\\$\\s*(\\d+)"
                                      - "\\s*\\$\\s*(\\d+)"
                                - object_field_set:
                                    object: item
                                    type: int
                                    field: non_member_fee
                          - find:
                              path: .col9
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: ages
                          - find:
                              path: .col10
                              do:
                                - parse
                                - object_field_set:
                                    object: item
                                    field: is_available
                          - find:
                              path: .ajaxLoad.info-icon.tooltips
                              do:
                                - parse:
                                    attr: href
                                - walk:
                                    to: value
                                    do:
                                      - find:
                                          path: 'tr:nth-of-type(2) td:nth-of-type(2)'
                                          do:
                                            - parse
                                            - object_field_set:
                                                object: item
                                                field: gender
                          - object_save:
                              name: item
        - find:
            path: .next a
            do:
              - parse:
                  attr: href
              - link_add

And JSON scheme we used for it:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Activities",
    "description": "Park district activities",
    "type": "object",
    "properties": {
        "item": {
            "type": "object",
            "properties": {
                "number": {
                    "description": "The unique identifier for an activity",
                    "type": "string"
                },
                "name": {
                    "description": "Activity name",
                    "type": "string"
                },
                "descr": {
                    "description": "Activity description",
                    "type": "string"
                },
                "gender": {
                    "description": "Gender specification for an activity",
                    "type": "string"
                },
                "short_descr": {
                    "description": "Activity short description",
                    "type": "string"
                },
                "ages": {
                    "description": "Allowed ages",
                    "type": "string"
                },
                "days": {
                    "description": "Weekdays when activity takes place",
                    "type": "string"
                },
                 "member_fee": {
                    "description": "Fee for members",
                    "type": "number"
                },
                 "non_member_fee": {
                    "description": "Fee for non-members",
                    "type": "number"
                },
                 "is_available": {
                    "description": "Shows if activity is still available",
                    "type": "string"
                },
                 "location": {
                    "description": "Location where activity takes place",
                    "type": "string"
                },
                 "dates": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                             "start": {
                                "description": "Start date for activity session",
                                "type": "string"
                            },
                             "end": {
                                "description": "End date for activity session",
                                "type": "string"
                            }
                        },
                        "required": ["start","end"]
                    },
                    "minItems": 1,
                    "uniqueItems": true
                },
                 "time": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                             "start": {
                                "description": "Start time for activity event",
                                "type": "string"
                            },
                             "end": {
                                "description": "End time for activity event",
                                "type": "string"
                            }
                        },
                        "required": ["start","end"]
                    },
                    "minItems": 1,
                    "uniqueItems": true
                }
           },
            "required": ["number","name","gender"]
        }
    },
    "required": ["item"]

}
Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.