Geocoding How to bind 250 thousand addresses to coordinates in 10 minutes?



Hello, Habr!

In this article, I would like to share my experience in solving a small problem with a large number of addresses. If you have ever worked with the geocoding API or used online tools, then I think you share my pain of waiting for the result for several hours, or even more.

This is not about complex optimization algorithms, but about using a packet geocoding service that takes a list of addresses as input and returns a file with the results. This can reduce processing time from a few hours to minutes.

First things first:


Background


The task arrived - “Bind to the coordinates of 24 thousand addresses.” Only two solutions to the problem occurred:

  1. Web application for geocoding, which was used at the university;
  2. Write a script based on the REST API of the geocoder.

In the first case, it turned out that the web application crashes after processing thousands of addresses. Distributing a dataset between colleagues is an idea that was immediately abandoned.

Therefore, you need to use the REST API of the geocoder to write your own script, with the results saved (this is not a completely legal way and you need to read the terms of use of the service). A new problem arises - it’s one thing when we use address search in the application and immediately get the result, but when the task is to process more than ten thousand addresses with preservation, the work of the script is greatly delayed. You can wait an hour or two, but a million addresses will have to geocode “immense time”, so you need to look for another solution and it is!

Large providers of geolocation services, in addition to the usual geocoding service, offer a packet geocoder (Batch Geocoder), just in order to process a large number of addresses in one request.

Batch Geocoding


The name of the service speaks for itself - we have a package (for example, a csv file with a list of addresses in the form of a table), which we upload to the server, and it does all the work for us.

The process looks like this:

  1. Preparing the dataset so that the service accepts it without errors;
  2. Setting the parameters of the result of the work (selection of columns, separator ...);
  3. Upload a file to the cloud;
  4. Waiting for processing to complete;
  5. Download the finished file.

Thanks to cloud computing power, what is done with a self-written script in 1 hour is completed in 1 minute.

The next step is to choose the company with the most loyal terms of use of the packet geocoder. Firstly, not everyone has such a service, others allow you to test the service with a serious limitation. Also, if you have very large volumes, you need to pay attention to the cost of additional transactions, in case the limit of the free package is exceeded.

Choosing a Batch Geocoding Service Provider


In the global market for geolocation services, the leading positions are occupied by:

  • Google Maps
  • HERE Technologies;
  • MapBox
  • TomTom
  • ESRI.

Of course, you should not forget about Yandex Technologies, which has a fairly strong position in Russia.

I took the following parameters as a basis for choosing a provider:

  • The number of requests to the geocoding service per month is free;
  • Limit on the number of transactions per day;
  • Availability of batch geocoding service;
  • Ability to use a packet geocoder in a free plan.

Each company has its own monetization model. Depending on the project, one or another model can play into the hands or vice versa be a significant limitation.

Google maps


To get started with Google geo services, the first thing you need to do is add credit card information to your account. A monthly limit of 200 virtual dollars, then comes the payment of additional transactions from the linked card. Within this limit, you can use various services, but each transaction is considered differently. For example, one thousand geocoding requests will cost $ 5, but the route building service is twice as expensive. More details can be found on the site, we are only interested in the geocoding service.

If $ 200 per month, then it’s easy to calculate the free number of transactions - 40,000 (geocoding service). There is no packet geocoder among services. This means that you have to write your own script and the result will be about 1 address per second, which is six hours for 24 thousand addresses. To speed up the process, you can try running the script on the Google Cloud APIs platform, but I decided to look for alternative solutions. There are no restrictions on the number of transactions per day, so all forty thousand can be spent at a time.

HERE Technologies


In the past, Nokia Maps, and in the even deeper past, Navteq, provides 250,000 transactions every month for free. Similarly to Google Maps, this number applies to all services and each is considered differently. When using the free package, you do not need to attach a bank card. If you exceed the limit, then for each additional thousand transactions you need to pay $ 1.

It is important to have a packet geocoder as a separate service, which is included in the free plan. Transactions in it are taken into account according to the same model as in the usual one, that is, each address in the file, the packet geocoder will perceive in one transaction.
By the title of the article, it is clear that I used the HERE batch geocoder, since you can spend all transactions on the geocoder and perform 250,000 geocoding operations per month. But this is not the only option, so we look at what other companies have.

Mapbox


When using the MapBox geocoder, 100 thousand transactions per month are available. The company adheres to the same monetization model with payment for additional transactions. Only there is an interesting option for “wholesalers” - the more transactions you have, the less they cost (of course there is a price reduction limit). For example, from 100 thousand to 500 thousand, an additional thousand requests will cost $ 0.75, from 500 thousand to 1 million - $ 0.60, etc., more can be found on the website. Unfortunately, batch geocoder is only available in a paid account.

Tomtom


The platform makes it possible to carry out 2500 transactions per day, approximately 75,000 per month. During testing and development, the daily limit does not look very attractive compared to competitors, but the payment for additional transactions is the most flexible. There are 8 payment options for an extra thousand requests and the price is reduced from $ 0.5 to $ 0.42.

Among the services there is a batch geocoder with the ability to process up to 10 thousand addresses per request (however, the daily limit must be taken into account).

Yandex Technologies


A model with a daily transaction limit for Yandex, but more loyal 25 thousand requests. If you multiply this number by the number of days in a month, you get an impressive figure of 750 thousand. The site presents prices for an additional thousand transactions in rubles ranging from 120 rubles. up to 11 rubles

Packet geocoder as a service is not presented, so to achieve some kind of optimization will fail.

ESRI


A very tempting free plan with 1 million transactions per month. The company also charges 50 credits to each account (approximate equivalent of $ 5). It is worth noting that this is the most loyal plan for the use of geolocation services. There is also a batch geocoding service, but you can use it only if you have a corporate account on the ArcGIS Online platform.

What to choose in the end?


The easiest way is to make a choice by compiling a small table:



As a result, my choice fell on HERE since it is the best option for solving my problem. Of course, I did far from complete analysis, ideally you need to run your dataset through all geocoders to assess quality. Plus, if you have several million addresses, you should think about a paid package and then you need to take into account the cost of add. transactions.

The purpose of the article is not to compare companies, but to solve the problem of optimizing the geocoding of a large volume of addresses. I just showed my thoughts when choosing a service provider.

Python Service Guide


First you need to create an account on the portal for developers and generate a REST API KEY in the project section.

Now you can work with the platform. I will describe only part of the functionality that the HERE packet geocoder has: data loading, status checking, saving results.

So, let's start by importing the necessary libraries:

import requests
import json
import time
import zipfile
import io
from bs4 import BeautifulSoup

Further, if no errors have occurred, create a class:

class Batch:

    SERVICE_URL = "https://batch.geocoder.ls.hereapi.com/6.2/jobs"
    jobId = None

    def __init__(self, apikey="your_api_key"):
        self.apikey = apikey

That is, during initialization, the class must pass its own key for the REST API.
The SERVICE_URL variable is the base URL for working with the batch geocoding service.
And in jobId the identifier of the current work of the geocoder will be stored.

An important point is the correct data structure upon request. The file must contain two required columns: recId and searchText. Otherwise, the service will return a response with information about the download error.

Here is an example dataset:

   recId; searchText
   1; -, . , 6
   2; ,  1,  -., 72
   3; 425 W Randolph St Chicago IL 60606
   4; , DJ106 20-30, Sibiu 557260
   5; 200 S Mathilda Ave Sunnyvale CA 94086
  

Function for uploading a file to the cloud:

def start(self, filename, indelim=";", outdelim=";"):
        
        file = open(filename, 'rb')

        params = {
            "action": "run",
            "apiKey": self.apikey,
            "politicalview":"RUS",
            "gen": 9,
            "maxresults": "1",
            "header": "true",
            "indelim": indelim,
            "outdelim": outdelim,
            "outcols": "displayLatitude,displayLongitude,locationLabel,houseNumber,street,district,city,postalCode,county,state,country",
            "outputcombined": "true",
        }

        response = requests.post(self.SERVICE_URL, params=params, data=file)
        self.__stats (response)
        file.close()


Everything is quite simple, open a file with a list of addresses for reading, form a dictionary of GET request parameters. Some parameters are worth explaining:

  • “Action”: “run” - start of address processing;
  • “politicalView”: “RUS” – . ( );
  • “gen”: 9 – ( );
  • “maxresults”: 1 – ;
  • “header”: true – ;
  • “indelim”: “;” – , ;
  • “outdelim”: “;” – ;
  • “outcols”: “” – , ;
  • “outcombined”: true – .

Next, just send a request using the requests library and display statistics. Of course, you need to close the file at the end of the function. The __stats function parses the response of the server, which contains the Id of the running work, and also displays general information about the operation.

The next step is to check the status of the work. The request is formed in a similar way, only it is necessary to transfer the operation Id. The action parameter must contain the value “status”. The __stats function displays complete statistics to the console to estimate the geocoder shutdown time.

    def status (self, jobId = None):

        if jobId is not None:
            self.jobId = jobId
        
        statusUrl = self.SERVICE_URL + "/" + self.jobId
        
        params = {
            "action": "status",
            "apiKey": self.apikey,
        }
        
        response = requests.get(statusUrl, params=params)
        self.__stats (response)

One of the most important functions is to save the result. For convenience, it is better to immediately unzip the file that comes from the server. The request to save the file is identical to checking the status, just add / result at the end.

    def result (self, jobId = None):

        if jobId is not None:
            self.jobId = jobId
        
        print("Requesting result data ...")
        
        resultUrl = self.SERVICE_URL + "/" + self.jobId + "/result"
        
        params = {
            "apiKey": self.apikey
        }
        
        response = requests.get(resultUrl, params=params, stream=True)
        
        if (response.ok):    
            zipResult = zipfile.ZipFile(io.BytesIO(response.content))
            zipResult.extractall()
            print("File saved successfully")
        
        else:
            print("Error")
            print(response.text)

The final function for parsing the response of the service. Also her task is to save the identifier of the current geocoding task:

    def __stats (self, response):
        if (response.ok):
            parsedXMLResponse = BeautifulSoup(response.text, "lxml")

            self.jobId = parsedXMLResponse.find('requestid').get_text()
            
            for stat in parsedXMLResponse.find('response').findChildren():
                if(len(stat.findChildren()) == 0):
                    print("{name}: {data}".format(name=stat.name, data=stat.get_text()))

        else:
            print(response.text)

To test it, just run the Python interpreter in the script folder. The Batch class is in the geocoder.py file :

>>> from geocoder import Batch
>>> service = Batch(apikey="   REST API")
>>> service.start("big_data_addresses.csv", indelim=";", outdelim=";")

requestid: "  Id "
status: accepted
totalcount: 0
validcount: 0
invalidcount: 0
processedcount: 0
pendingcount: 0
successcount: 0
errorcount: 0


Great job started. Check the status:

>>> service.status()

requestid: "  Id "
status: completed
jobstarted: 2020-04-27T10:09:58.000Z
jobfinished: 2020-04-27T10:17:18.000Z
totalcount: 249999
validcount: 249999
invalidcount: 0
processedcount: 249999
pendingcount: 0
successcount: 249978
errorcount: 21

We see that the processing of the dataset is complete. In just seven minutes, it was possible to procode the 250 thousand addresses (excluding errors - errorcount). It remains to save the results:

>>> service.result()
Requesting result data ...
File saved successfully

Full Batch Class Description


I think that it doesn’t hurt to add the script completely:

import requests
import json
import time
import zipfile
import io
from bs4 import BeautifulSoup

class Batch:

    SERVICE_URL = "https://batch.geocoder.ls.hereapi.com/6.2/jobs"
    jobId = None

    def __init__(self, apikey="   REST API "):
        self.apikey = apikey
        
            
    def start(self, filename, indelim=";", outdelim=";"):
        
        file = open(filename, 'rb')

        params = {
            "action": "run",
            "apiKey": self.apikey,
            "politicalview":"RUS",
            "gen": 9,
            "maxresults": "1",
            "header": "true",
            "indelim": indelim,
            "outdelim": outdelim,
            "outcols": "displayLatitude,displayLongitude,locationLabel,houseNumber,street,district,city,postalCode,county,state,country",
            "outputcombined": "true",
        }

        response = requests.post(self.SERVICE_URL, params=params, data=file)
        self.__stats (response)
        file.close()
    

    def status (self, jobId = None):

        if jobId is not None:
            self.jobId = jobId
        
        statusUrl = self.SERVICE_URL + "/" + self.jobId
        
        params = {
            "action": "status",
            "apiKey": self.apikey,
        }
        
        response = requests.get(statusUrl, params=params)
        self.__stats (response)
        

    def result (self, jobId = None):

        if jobId is not None:
            self.jobId = jobId
        
        print("Requesting result data ...")
        
        resultUrl = self.SERVICE_URL + "/" + self.jobId + "/result"
        
        params = {
            "apiKey": self.apikey
        }
        
        response = requests.get(resultUrl, params=params, stream=True)
        
        if (response.ok):    
            zipResult = zipfile.ZipFile(io.BytesIO(response.content))
            zipResult.extractall()
            print("File saved successfully")
        
        else:
            print("Error")
            print(response.text)
    

    
    def __stats (self, response):
        if (response.ok):
            parsedXMLResponse = BeautifulSoup(response.text, "lxml")

            self.jobId = parsedXMLResponse.find('requestid').get_text()
            
            for stat in parsedXMLResponse.find('response').findChildren():
                if(len(stat.findChildren()) == 0):
                    print("{name}: {data}".format(name=stat.name, data=stat.get_text()))

        else:
            print(response.text)

Results Analysis


As a result, I went from slowly working online applications to batch geocoding services. The choice of a geo-service provider depends entirely on the tasks that confront you. I regularly receive requests for processing a large number of addresses and the approach described in the article helped to significantly reduce the time.

I hope that this article will be useful and of course I am open to comments and additions!

All Articles