Get an extract from Rosreestr through FSIS USRN and python. Part 2

In this article, we will try to get extracts from the Federal State Property Fund of the Unified State Register of Enterprises with the help of python (selenium) for several real estate properties at once, we will solve the captcha using the anticaptcha service using its api. We don’t touch the neural network when meeting with the captcha, since they may seem more difficult to implement, and the percentage of “successful guessing” captchas with their help is still lower.

Link to the 1st part of the article: Get an extract from Rosreestr through FSIS USR and python. Part 1




The beginning of our program will be similar to the one from the program of the previous post. First, there is an automatic authorization on the FSIS USN service, entering the login key:
the code
import webbrowser,time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.action_chains import ActionChains
import openpyxl
import pyautogui
import os
from python3_anticaptcha import ImageToTextTask

wb = openpyxl.load_workbook('rosreestr-objects.xlsx')
sheet=wb.get_active_sheet()

browser = webdriver.Firefox()
browser.implicitly_wait(40)
browser.get ('https://rosreestr.ru/wps/portal/p/cc_present/ir_egrn')

act = browser.find_element_by_css_selector('.v-panel-content > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > input:nth-child(1)')
for i in '---------':
	act.send_keys(i)
time.sleep(2)
act = browser.find_element_by_css_selector('.v-panel-content > div:nth-child(1) > div:nth-child(1) > div:nth-child(3) > div:nth-child(1) > input:nth-child(1)')
for i in '----':
	act.send_keys(i)
time.sleep(2)
act = browser.find_element_by_css_selector('.v-panel-content > div:nth-child(1) > div:nth-child(1) > div:nth-child(5) > div:nth-child(1) > input:nth-child(1)')	
for i in '----':
	act.send_keys(i)
time.sleep(2)
act = browser.find_element_by_css_selector('.v-panel-content > div:nth-child(1) > div:nth-child(1) > div:nth-child(7) > div:nth-child(1) > input:nth-child(1)')                                         
for i in '----':
	act.send_keys(i)
time.sleep(2)
act = browser.find_element_by_css_selector('.v-panel-content > div:nth-child(1) > div:nth-child(1) > div:nth-child(9) > div:nth-child(1) > input:nth-child(1)')                                           
for i in '-----------':
	act.send_keys(i)
time.sleep(2)
act = browser.find_element_by_css_selector('.v-button-normalButton > span:nth-child(1) > span:nth-child(1)')
act.click()


Instead of “---”, you need to enter the corresponding parts of the authorization code of the Federal State Property Inspection Register, which are separated by the “-” symbol in the authorization code.
"Rosreestr-objects.xlsx" - a file with real estate objects for which requests will go. If there are more than 20 properties, problems may arise, which are described below.

Now we create a list of objects, taking them from the excel table and send to the site in the desired row, click search:
the code
n=1
while n<11: #    excel
        i=sheet['C'+str(n)].value
        test.append(i+';')
        n+=1
#   
act = browser.find_element_by_css_selector('.v-gridlayout-margin > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > span:nth-child(1) > span:nth-child(2)')
act.click()        
time.sleep(1)
act = browser.find_element_by_css_selector('.v-verticallayout-searchFormOuter > div:nth-child(1) > div:nth-child(2) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > input:nth-child(1)')
act.click()        
act.clear()        
for i in test:
        act.send_keys(i)        
time.sleep(3)
act = browser.find_element_by_css_selector('.v-filterselect-error > input:nth-child(1)')
act.click()        
act.clear()        
for i in ' ':
        act.send_keys(i)
time.sleep(5)
act.send_keys(Keys.ENTER)
act = browser.find_element_by_css_selector('.v-horizontallayout-borderTop > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > span:nth-child(1) > span:nth-child(1)')
act.click()


At the same time, several objects to search flies to the site at once, which additionally saves time.

Register on anti-captcha.com


It is not necessary to use this particular site, but you can take it as a basis.
The resource, as the name says, provides a solution for solving captcha. Such is the tautology. For a fee. 1 dollar per 1000 captcha. It should be enough for the first time. The essence of his work is simple - the captcha sent to the site (or its service) is solved by living people (or not quite alive) on the other side of the screen. The solution is almost instantaneous, able to compete in speed with neural networks. The accuracy is almost one hundred percent.
In our case, the algorithm is as follows: they took a photo of captcha from the screen, sent the photo to the service using api, took the answer. Thus, you can solve almost any captcha, consisting of numbers, letters, etc.

So, after registering on the site, and paying 1 dollar, you need to pick up your key in the api section:
picture



picture2


That's all, we don’t need a site anymore.

We return to the program.


Since the objects on the screen are in the list, the program will sequentially go into each object and make a request on it, solving the captcha:
the code
x=1
ActionChains(browser).move_to_element(browser.find_element_by_xpath('/html/body/div[1]/div[6]/div[4]/div/div/section/div[2]/div[2]/div/div/div[2]/div/div[2]/div/div/div/div[1]/div/div/div/div[5]/div/div/div[2]/div[1]/table/tbody/tr['+str(x)+']')).click().perform()
        time.sleep(2)        
        act = browser.find_element_by_css_selector('.v-textfield')
        act.click()
        time.sleep(1)
        act = browser.find_element_by_tag_name('html')
        act.send_keys(Keys.PAGE_DOWN)# ,   
        time.sleep(2)
        a=0
        os.chdir('C:\\1')
        im=pyautogui.screenshot(imageFilename=str(a)+'.jpg',region=(238,394,220,70))#  12801024
        #im=pyautogui.screenshot(imageFilename=str(a)+'.jpg',region=(317,404,160,200))#    
        time.sleep(1)
        captcha_file = 'C:\/1\/0.jpg'


Captcha on the object is not immediately visible on the screen, so page_down is pressed, then a photo of the captcha and its saving to disk. The screen resolution can be different for everyone, the program was written for a screen of 1280x1024.
In order not to suffer with the selection of coordinates on the screen when determining the boundaries of the captcha that you want to photograph, I leave the code to determine the position of the mouse on the screen:
the code
#! python3
# mouseNow.py - Displays the mouse cursor's current position.
import pyautogui
print('Press Ctrl-C to quit.')
try:
    while True:
        # Get and print the mouse coordinates.
        x, y = pyautogui.position()
        positionStr = 'X: ' + str(x).rjust(4) + ' Y: ' + str(y).rjust(4)
        pixelColor = pyautogui.screenshot().getpixel((x, y))
        positionStr += ' RGB: (' + str(pixelColor[0]).rjust(3)
        positionStr += ', ' + str(pixelColor[1]).rjust(3)
        positionStr += ', ' + str(pixelColor[2]).rjust(3) + ')'
        print(positionStr, end='')
        print('\b' * len(positionStr), end='', flush=True)

except KeyboardInterrupt:
    print('\nDone.')



Now we will use api anticaptcha and send the image for recognition to the service, the program will enter the result into the corresponding window on the Rosreestr website itself:

the code
ANTICAPTCHA_KEY = "-------------------------------"
        result = ImageToTextTask.ImageToTextTask(anticaptcha_key=ANTICAPTCHA_KEY).captcha_handler(captcha_file=captcha_file)        
        b=result.get('solution').get('text')#    
        print(b)
        act = browser.find_element_by_css_selector('.v-textfield')
        act.click()
        for a in b:
                act.send_keys(a)
                time.sleep(0.1)
        act.send_keys(Keys.ENTER)
        time.sleep(1)


* Do not forget to enter the api key instead of "-------------------------------"

It remains to push the appropriate buttons and continue the cycle of real estate:
the code
act.click()
        time.sleep(3)        
        act = browser.find_element_by_css_selector('.v-table-body-wrapper')
        act.send_keys(Keys.DOWN)
        act.send_keys(Keys.DOWN)
        time.sleep(3)        
        x+=1


Here difficulties may arise if there are too many objects (50 or more). This is due to the shift of the viewport and some of the objects do not fall into the window visible by the program. How to deal with this? Perhaps add another act.send_keys (Keys.DOWN) to the code above.
What should I do if even the people on the other side of the screen decided to fix the captcha incorrectly (by the way, the captcha sometimes does not load and even updating the picture does not save)? Add error handling to code. But this is a completely different story.

Don't like anticaptcha? Using Rucaptcha!



In order to switch to a similar service, which, for subjective reasons, works faster in terms of returning recognized captcha and costs a little cheaper (33 rubles for 1000 captcha), it is enough to change two points.
Firstly, get the api code by registering on the site rucaptcha.com
Secondly, indicate this in the program code by changing the corresponding lines:

The code
RUCAPTCHA_KEY = "  api"
from python_rucaptcha import ImageCaptcha
result = ImageCaptcha.ImageCaptcha(rucaptcha_key=RUCAPTCHA_KEY).captcha_handler(captcha_file=captcha_file)
b=result["captchaSolve"] #   



As you can see above, the services are similar. What to choose is a matter of taste.

Program - download .
Test real estate objects - download .

All Articles