Import USRLE by means of Apache NiFi. Step 1 - Download Files via HTTPS

In one of the projects, there was a need to transfer the process of importing data from third-party systems to a microservice architecture. Apache NiFi was chosen as a tool. As the first experimental, the import of the Federal Tax Service register was selected.


USRLE data is published as XML files packaged in ZIP archives. Archives are laid out daily on a resource https://ftp.egrul.nalog.ru/in a separate directory for the corresponding date. For access, the key is # PKCS12.


The task that needs to be solved with NiFi is downloading files from the Federal Tax Service and preparing the downloaded data for import into our services. This article describes how to implement file downloads.


Data source


Obtaining data from the USRLE is carried out as part of the service of the Federal Tax Service "Integration and access to the databases of the USRLE and the USRIP." A description of the interaction model is presented here .


This is the resource of the Federal Tax Service, from which you want to download files.
https://ftp.egrul.nalog.ru/?dir=EGRUL


Directories with the FULL suffix are the unloading of the full register at the beginning of the corresponding year. The remaining catalogs are daily updates in the register. We are interested in downloading daily updates.


Stream setup in Apache NiFi


The task of the stream is to collect a list of catalog files with yesterday's upload, get these files and unzip them.


NiFi :

  1. FlowFile,
  2. HTML
  3. HTML ,


FlowFile


FlowFile GenerateFlowFile.


GenerateFlowFile






24 FlowFile fnsEgrulURL. https://ftp.egrul.nalog.ru/?dir=EGRUL/14.04.2020. NiFi Expression Language:

${literal('https://ftp.egrul.nalog.ru/?dir=EGRUL/'):append(${now():toNumber():minus(86400000):format('dd.MM.yyyy')})}

Those. the current date is taken and converted to a numeric representation of the date. 86,400,000 milliseconds are subtracted from it. The result is converted to a string representation of the date in the format dd.MM.yyyy. The resulting date is added to the permanent part of the link.

At the output, we get a FlowFile of the following form:


Screenshots FlowFile




Retrieving Directory Content


The InvokeHTTP processor is used to obtain the contents of the directory . He performs a GET request using the link to the directory with yesterday's upload. In response, the processor receives the catalog HTML code and adds this HTML code to the FlowFile as content.


InvokeHTTP processor screenshots




:
HTTPMethod β€” GET;
Remote URL β€” , URL . ${fnsEgrulURL} β€” FlowFile fnsEgrulURL;
SSL Context Service β€” SSLContextService , .. HTTPS. #PKCS12 .


FlowFile , , β€” HTML- .


SSLContextService


SSLContextService




SSLContexService #PKCS12 , , .


cacerts JDK. . https://fns.egrul.nalog.ru , #PKCS12. .



In the certificate chain, you must select the Russian DPC Tax Service certificate and export it in .CER format in DER encoding . Next, you need to import the certificate from the received file into the cacerts repository using the keytool utility . For example, like this:

C:\Program Files\Java\jdk1.8.0_121\bin> keytool -importcert -keystore "C:\Program Files\Java\jdk1.8.0_121\jre\lib\security\cacerts" -file {   .CER}

The default password for cacerts is changeit .

cacerts , NiFi . , Persistent Volume. SSLContextService. PKSC12, cacerts β€” JKS.



GetHTMLElement, HTML- FlowFile-. ZIP-.


<div id="page-content" class="container">
    <div id="directory-list-header">
        <div class="row">
            <div class="col-md-7 col-sm-6 col-xs-10"></div>
            <div class="col-md-2 col-sm-2 col-xs-2 text-right"></div>
            <div class="col-md-3 col-sm-4 hidden-xs text-right"> </div>
        </div>
    </div>
    <ul id="directory-listing" class="nav nav-pills nav-stacked">
                            <li data-name=".." data-href="https://ftp.egrul.nalog.ru/?dir=EGRUL">
                <a href="https://ftp.egrul.nalog.ru/?dir=EGRUL" class="clearfix" data-name="..">
                    <div class="row">
                        <span class="file-name col-md-7 col-sm-6 col-xs-9">
                            <i class="fa fa-level-up fa-fw"></i>
                            ..                                </span>
                        <span class="file-size col-md-2 col-sm-2 col-xs-3 text-right">
                            -                                </span>
                        <span class="file-modified col-md-3 col-sm-4 hidden-xs text-right">
                            2020-04-05 22:00:00                                </span>
                    </div>
                </a>
            </li>
                            <li data-name="EGRUL_2020-04-05_1.zip" data-href="EGRUL/05.04.2020/EGRUL_2020-04-05_1.zip">
                <a href="EGRUL/05.04.2020/EGRUL_2020-04-05_1.zip" class="clearfix" data-name="EGRUL_2020-04-05_1.zip">
                    <div class="row">
                        <span class="file-name col-md-7 col-sm-6 col-xs-9">
                            <i class="fa fa-file-archive-o fa-fw"></i>
                            EGRUL_2020-04-05_1.zip                                </span>
                        <span class="file-size col-md-2 col-sm-2 col-xs-3 text-right">
                            528.78KB                                </span>
                        <span class="file-modified col-md-3 col-sm-4 hidden-xs text-right">
                            2020-04-05 22:00:24                                </span>
                    </div>
                </a>                     
                    <a href="javascript:void(0)" class="file-info-button">
                        <i class="fa fa-info-circle"></i>
                    </a>
            </li>
    </ul>
</div>


GetHTMLElement


:
URL β€” URL HTML-;
CSS Selector β€” . li[data-name^=EGRUL] β€” li, data-name, EGRUL;
Output Type β€” Attribute β€” HTML-;
Destination β€” flowfile-attribute β€” FlowFile- ( HTMLElement);
Attribute Name β€” , . abs:${literal('data-href')} β€” URL (abs:) + data-href , CSS-.


, CSS- FlowFile ZIP- HTMLElement.



InvokeHTTP , HTML- . URL HTMLElement, ZIP-. SSLContextService .


InvokeHTTP


ZIP- FlowFile .



UnpackContent. β€” ZIP.


UnpackContent processor screenshots


At the output, the processor creates a FlowFile for each XML file unpacked from the ZIP archive.


Further...


Further, each XML needs to be converted to JSON and broken down by organization, because each XML contains from 1 to 1000 register statements. And from JSON in the future it will be possible to load data into SQL or NoSQL storage.


Converting XML to JSON and AVROSchema are in the next article.


All Articles