ElasticSearch + Python

This post is dedicated to using ElasticSearch in Python. If you already know Python, and some theory about both SQL and NoSQL databases, this article will be a great fit for you.

This article will essentially chronicle my adventure with ElasticSearch, while working on a new project, which will be unveiled I hope soon.

This article assumes that you will be using Docker to deploy ElasticSearch.

What is ElasticSearch?

ElasticSearch is a scalable database/search system built on top of Apache Lucene. It’s usefulness as database is limited so you shouldn’t use it as your source-of-truth database, but rather reload your primary source of data into ElasticSearch either online or on predictable intervals. ElasticSearch can then be used to search through the data.

The deployment

First off, we need to deploy ElasticSearch. To do that, we will deploy a container called elasticsearch, expose ports 9200 and 9300, and mount a volume at /usr/share/elasticsearch/data, which is where ElasticSearch stores data. Next, we need to define two following environment variables:

  • discovery.type=single-node
  • ES_JAVA_OPTS=-Xms1g -Xmx1g

The first is to tell ElasticSearch that it’s not a part of the cluster, and the second part is to disable ElasticSearch’s default behaviour, which is to allocate a portion of a system memory. On my 40 GB server it tried to allocate 23 GB, and of course failed miserably.

Now that we got that done, let’s get to analyze ElasticSearch a bit

Theory

ElasticSearch is a database/search engine built on top of Lucene. It stores documents formatted a JSON, so it’s a NoSQL database.

You store mostly schema-free documents (corresponding to rows in a table in SQL) in indices (think tables/databases in SQL), but more like a giant table. Rows might have different columns (might follow different schema) described best by their mapping, but those don’t work much like separate tables for your data, since they’re stored physically in a single table (the index). They are also being removed as we speak. Data put together in a single index will be searchable together. Indices are identified by their lowercase names. A mapping essentially is a hashtable of field name to field data type.

Indices might be split horizontally across multiple shards (think your run-of-the-mill SQL sharding). They might also have multiple replicas, but beware! An index with 2 replicas will have the respective property set to 1, because ElasticSearch doesn’t consider primary copy to be a replica at all.

A single shard of a single replica is backed by a single Lucene index, which basically limits what ElasticSearch can do and cannot do.

Python

First things first, we got to install the ElasticSearch Python client. The Python client is basically a thin wrapper over urllib3, as the only way to interface with ElasticSearch is through a HTTP API, so you will see some pretty threatening log entries during heavy usage, but don’t be afraid – everything is in order).

But for now, issue:

pip install elasticsearch

And open your Python shell. The following code will connect to ElasticSearch and verify that the connection was made:

import elasticsearch
es = elasticsearch.Elasticsearch(['your_elasticsearch_address'])
if not es.ping():
    raise RuntimeError('Failed to connect to ElasticSearch')

Then, we have to create an index. Let’s assume that we’ll be storing data about companies. Let us create an index called vat_data that will store data about companies:

settings = {
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0

        }
    },
    "mappings": {
        "properties": {
                "vat_no": {
                    "type": "keyword"
                },
                "name": {
                    "type": "text"
                },
                "address": {
                    "type": "text"
                },
                "postcode": {
                    "type": "keyword"
                },
                "country": {
                    "type": "text"
                }
        }
    }
}
es.indices.create('vat_data', body=settings)

text is a text type for full-text searches (ElasticSearch does some fun stuff with this data). Note that some fields have been defined as a keyword – it’s necessary for returning exact matches. You can of course duplicate your data, creating let’s say a field called kw_name of type keyword, which you will assign to same value as name.

Beware! If the index already exists, a elasticsearch.exceptions.RequestError will be raised. To check whether an index already exists, you can use:

if not es_object.indices.exists('vat_data'):
    raise RuntimeError('Index does not exist')

Now let’s put some data there:

record = {
    'vat_no': '000000000',
    'name': 'KOWALSKI LIMITED',
    'address': '1 PICCADILLY CIRCUS',
    'country': 'GB',
    'postcode': 'W1J 9LL'
}
es.index('vat_data', id='000000000', body=record)

id, or better known as _id, is a special kind of field. It is added to each document, and serves as a primary key for it’s later retrieval, update and deletion.

Warning: Data is not immediately ready to be searched for after insertion. I’d advise waiting at least 10 seconds after insertion for the data to be returned. Take care of it in your unit tests!

Searching

In order to search ElasticSearch, we need to build a query. Let’s ask all fields, is given value in them?

import typing as tp

def ask(query: str) -> tp.Iterator[dict]:
    hits = elastic.search(index='vat_data', body={
        'query': {
            'multi_match': {
                'query': query,
                'fields': ['vat_no', 'address', 'name', 'postcode']
            }
        }
    }, request_timeout=30)


    for hit in hits['hits']['hits']:
        yield hit['_source']

Notice how we used a multi-match query. This is a way to ask for a match on a single field. You could for example now execute ask('LONDON') to list you all companies in LONDON. In order to match only a single field, we would use a match query:

def ask_address(query: str) -> tp.Iterator[dict]:
    hits = elastic.search(index='vat_data', body={
        'query': {
            'match': {
                'address': {
                    'query': query
                }
            }
        }
    }, request_timeout=30)

    for hit in hits['hits']['hits']:
        yield hit['_source']

There are multiple types of queries to choose from, such as query string, which allows you to interface directly with Lucene.

If you need a single record to match multiple fields with different options, try this:

elastic.search('vat_data', {
    "query": {
        "bool": {
            "must": [{
                    "match": {
                        "address": "DRIVE"
                    }
                },
                {
                    "match": {
                        "name": "LINCOLN"
                    }
                }
            ]
        }
    }
}

Note however that this will match only entire words, so for example it won’t match LINCOLNSHIRE. To perform a wildcard match on one field do:

elastic.search('vat_data', {
    "query": {
        "bool": {
            "must": [{
                    "wildcard": {
                        "address": {
                           "value": "*ST*"
                        }
                    }
                },
                {
                    "wildcard": {
                        "name": {
                           "value": "LI*"
                        }
                    }
                }
            ]
        }
    }
}

Just take care with starting your requests with a *. They will have poor performance.

If you want to have an exact match of a field called vat_no, the following will suffice:

elastic.search('vat_data', {
    "query": {
        "term": {
            "vat_no": {
                "value": "023432543"
            }
         }
    }})

Be warned! You can only have exact matches on fields that you’ve defined as having the type of keyword.

That’s all folks! Hope you found that enjoyable.

Leave a comment

Your email address will not be published. Required fields are marked *