This article is an overview of IBM Data Science Professional Certificate certification .
As a newbie in Python, I had to deal with the implementation of tasks:
- Loading and parsing HTML tables
- Cleaning uploaded data
- Search for geographical coordinates at the address of the object
- Download and process GEOJSON
- Building interactive heat maps (heat map)
- Building interactive background cartograms (choropleth map)
- Transformation of geographic coordinates between a spherical WGS84 and a Cartesian coordinate system UTM
- Representation of spatial geographical objects in the form of a hexagonal grid of circles
- Search for geographic features located at a specific distance from a point
- Georeferencing geographic features to complex surface polygons
- Descriptive Statistical Analysis
- Analysis of categorical variables and visualization of results
- Correlation analysis and visualization of results
- Segmentation using k-Mean clustering and the elbow method
- Analysis and visualization of clusters
During the certification, 9 courses were required:
3 .
«Python for Data Science and AI» «Databases and SQL for Data Science» , .
«Data Analysis with Python», «Data Visualization with Python», «Machine Learning with Python» . Python.
Applied Data Science Capstone – , Python, .
Applied Data Science Capstone.
— Github.
— 700 .
.
Python
1. Introduction
1.1 Background
— 12 . 2561.5 ², 4924.96 /² 1.
12 (125 , 2 , 19 ) 30429 /² "", 560 /² "" 2.
68,768 /² "" 438,568 /² "" 3.
1.2 Business Problem
, , , . () .
, , , .
, , , .
:
2. Data acquisition and cleaning
2.2. Data requirements
, :
, :
GEOJSON
( , , ...),
2.3. Describe data sources
HTML .
Nominatim. , , , .
GEOJSON
( , ..) Foursquare API. , . 100 1000 . :
- , 100 .
- , Foursquare API .
- , Foursquare API ( )
- .
2.4. Describe data cleansing
HTML , . :
- ' \n\t', ' ', ''
- "" ""
- float int
146 :
- Borough_Name — —
- District_Name — ,
- Borough_Type —
- OKATO_Borough_Code — OKATO
- OKTMO_District_Code — OKTMO
- Borough_Area —
- Borough_Population —
- Borough_Population_Density —
- Borough_Housing_Area — .
- Borough_Housing_Area_Per_Person —
- Borough_Housing_Price —
.

Nominatim, , ( 10 ) .
. , Foursquare API.

, ( , , ...), , . 28 000 .
. Foursquare API.

Forsquare API, 34460 ( , , ...) 7899 .
(350 ) , (300 ), .
27622 28 000 .
, , . GEOJSON .
, , .
, .
20864 ( , , ...), 120 . .
.

" " " ".

3. Methodology
:
:
, . .
3.1. Exploratory Data Analysis
:
- District_Name — ,
- Borough_Area —
- Borough_Population —
- Borough_Population_Density —
- Borough_Housing_Area — .
- Borough_Housing_Price —
.

, 12 194 253 943 .
109 421 /² 438 568 /².
District_Name Borough_Population. , District_Name , , «», «», «» «».

District_Name Borough_Housing_Price. , , District_Name .

.
.
. (p-value):
- p-value < 0.001: ,
- p-value is < 0.05: ,
- p-value is < 0.1: ,
- p-value is > 0.1: ,


- Borough_Area, Borough_Population_Density Borough_Population ,
- Borough_Housing_Area Borough_Population ,
- Borough_Area, Borough_Population_Density, Borough_Housing_Area Borough_Housing_Price ,
- Borough_Area Borough_Population_Density ,
3.2. Clustering
, , .
, K-Means Elbow , , .
Elbow , ( 1 10). (distortion) (inertia).
Distortion Inertia .
(elbows) 3 5 .
3- .


, K-Means , :

:
- "0" — (78538 ), (173695 /²) (10328 /²)
- "1" — (153187 ), (160741 /²) (13312 /²)
- "2" — (79805 ), (333794 /²) (10533 /²)
"1" :
boxplot .


(choropleth map)
, "1" .
, .

4. Result
:
- :
- ,
- (heat map) (choropleth map)
.
.

20864 ( , , ...), 928 "Gym / Fitness Center", 259 .
.

(heat map) (choropleth map) (HTML zip ) Interactive map
.



5. Discussion
:
, :
- GEOJSON
- ( , , ...),
, GitHub .csv .
K-Means Elbow . Distortion Inertia , 3- . 5 , .
.
To determine the proximity of competing fitness facilities, an approach was used consisting in displaying fitness centers on an interactive thermal geographical map. This approach allowed us to visually analyze the proximity of existing fitness centers in each area and identify areas with low density.
Based on the data obtained, an additional analysis can be carried out using categorical segmentation of fitness objects and automatic calculation of recommended locations for new fitness centers, taking into account the density of competing fitness objects.