📯 👨‍🚒 💇🏾 How to forecast airfare? 🤜🏼 🥂 🍄

Hello everyone!

This is the third article about how I make a small and comfortable service, which ~~in theory~~ should help with travel planning. In this article I’ll talk about how to predict airfare with Clickhouse, Catboost and 1TB * data at hand.

What is it for?

One of the main features of cheapster.travel is the flexible combination of complex routes (more in the previous article ). In order to combine “all-with-all”, an aggregator cache is used, in which there are not always tickets that are rarely looked for, and they are sorely lacking to build complex routes. Those. hot tickets (cheap) on which to base complex route there , but not enough 1-2 segments of the "normal" tickets (at the regular price, not on the most popular direction). It was this problem that led me to the need to build a model that could predict airfare.

Task formalization

You need to be able to predict tickets for direct flights (only round-trip)
You need to be able to regularly predict and store this in the database (simple scenario)
Need to be able to predict "on the fly" (complex scenario)
This all happens on a very limited hardware - therefore, a minimum of manipulation with large amounts of data

How to do it?

To begin with, we will train the model: prepare the dataset, select the maximum number of features in the columns, upload it to tsv, load it into the DataFrame / Pool, ~~analyze, select the parameters ...~~ Stop, we have too much data and they do not fit into memory , - catch the following errors:

MemoryError: Unable to allocate array with shape (38, 288224989) and data type float64
OSError: [Errno 12] Cannot allocate memory

To get around this limitation, it was necessary to learn iteratively in small pieces, it looks like this:

model = CatBoostRegressor(cat_features=cat_features,
          iterations=100,
          learning_rate=.5,
          depth=10,
          l2_leaf_reg=9,
          one_hot_max_size=5000)

for df in tqdm(pd.read_csv('history.tsv', sep='\t', 
                           na_values=['\\N'], 
                           chunksize=2_000_000)):
    ...
     model.fit(X=df[df.columns[:-1]][:train_size].values,
                  y=df['price'][:train_size].values,
                  eval_set=eval_pool,
                  verbose=False,
                  plot=False,
                  init_model=model) # <--

The result was a model with RMSE ~ 100 - in general, I would have been happy with such a result, but after a little analysis and "normalization" of predictions (negative and values that differ greatly from min / max values in history are brought to the corresponding boundaries of historical prices) . After that, the target metric is ~ 80, taking into account the fact that, in my experience, there is almost no logic and common sense in pricing airline tickets.

Features that affect the price most:

Statistics for features “Distance between cities”:

Great, we have a model - now it's time to use it. First of all, add the KX model, this is done with a simple config:

Config

<models>
    <model>
        <!-- Model type. Now catboost only. -->
        <type>catboost</type>
        <!-- Model name. -->
        <name>price</name>
        <!-- Path to trained model. -->
        <path>/opt/models/price_iter_model_2.bin</path>
        <!-- Update interval. -->
        <lifetime>0</lifetime>
    </model>
</models>

We make a regular prediction process - it’s easy enough to do it using Apache Airflow.

The resulting DAG looks like this

DAGa ( Airflow):

SimpleHttpOperator

insert_ow_in_tmp = SimpleHttpOperator(
    task_id='insert_ow_in_tmp',
    http_conn_id='clickhouse_http',
    endpoint=dll_endpoint,
    method='POST',
    data=sql_templates.INSERT_OW_PREDICTIONS_IN_TMP,
    pool='clickhouse_select',
    dag=dag
)

For prediction "on the fly" using ordinary sql:

select origin, destination, date,
         modelEvaluate('price', *)  predicted_price -- ,   
from log.history

+--------+-------------+------------+-----------------+
| origin | destination | date       | predicted_price |
+--------+-------------+------------+-----------------+
| VKO    | DEB         | 2020-03-20 | 3234.43244      |
+--------+-------------+------------+-----------------+
--* ,

I want to replace the fact that such an approach has been chosen, not only because it is easier to implement - there are still pluses:

There is no need to upload data to the outside of the KH (this means faster and less costly on the load on the iron)
No need to do etl processes (easier = more reliable)

We slightly correct the API and the front-end and get the long-awaited predictions.

These predictions also fit well into the section Airline ticket price history: The

functionality is available at cheapster.travel/history (it will open crookedly on the mobile, only large screens).

That's all, all a productive day!

An attempt to solve the problem of choosing air tickets before vacation
An attempt to solve the problem of choosing air tickets before vacation # 2

Another interesting feature

Combinator of difficult routes
Complex tickets (triangles)

PS
Important! Do not take these predictions as something that helps you choose a purchase date - the model may not correctly predict, moreover, its adequacy has not been verified by me or anyone else (all at your own peril and risk, without guarantees).

1TB * - this is if you upload to tsv, in KX it takes an order of magnitude less.

UPD:

Top Unobvious Issues When Using Catboost Bundles - Clickhouse

Categorical features in KH change the order and become at the end (and not in the order that was during the training);
modelEvaluate returns null - you need to check if you have null values in features, if you need to replace them with nan
In new versions there is an unobvious moment with the config format for KX, described here

How to forecast airfare?

What is it for?

Task formalization

How to do it?

Previous Articles

Another interesting feature

Top Unobvious Issues When Using Catboost Bundles - Clickhouse

More articles: