Features of designing a data model for NoSQL

Introduction


“You need to
run as fast just to stay in place, but to get somewhere, you must run at least twice as fast!”
(c) Alice in Wonderland


Some time ago I was asked to give a lecture to our company’s analysts on the topic of designing data models, because when we sit on projects for a long time (sometimes for several years) we lose sight of what is happening around the IT world. In our company (it just so happened), NoSQL databases are not used on many projects (at least for now), so in my lecture I separately paid some attention to them using the HBase example and tried to orient the presentation of the material on those who never worked. In particular, I illustrated some features of data model design using an example that I read several years ago in the article “Introduction to HB ase Schema Design” by Amandeep Khurana. Analyzing the examples, I compared among themselves several options for solving the same problem in order to better convey to the audience the main ideas.


Recently, “there’s nothing to do”, I wondered (long May weekends in quarantine mode are especially suitable for this), how much theoretical calculations will correspond to practice? Actually, the idea of ​​this article was born. A developer who has been working with NoSQL for several days may not learn anything new from it (and therefore can immediately spend half a hundred). But for analysts who have not yet worked closely with NoSQL, I think it will be useful for getting a basic understanding of the design features of data models for HBase.


Parsing an Example


, NoSQL , «» «». . NoSQL . NoSQL , , . , ( ). , «» , . NoSQL . , , .


«» , :


. , ( , Linkedin). :
  • , ( )
  • / / ( )

, . (, , , , : , .., «»), /. :


user_idfriend_id

ID


HBase , :


  • , full table scan,
    • , SQL- – ; , , Impala SQL- Join’ HBase, …

ID . « ID ?» . «» ( 1 (default), ):


RowKey
1:2:3:
1:2:

. : 1, 2, … — , ID . , . (1, 2 3), – (1 2) – HBase, :


  • ( -> , -> )

:


  • : , , , RowKey = «» , «» . , « » False;
  • : : RowKey = «», . - , , ID .
  • : :
    • RowKey = «» , , ;
    • , , «» , «» .

, , « », , -. n. (n-1). (-1) , - .


  • : . (n)
  • : : , => (n)
  • : :
    • – => (n)
    • «» . « », (n-1) . , «-» - – n. ( , (2)) (n) . : «» , , :

, O(n).
, , , , - . «count», . - , «count». , «count» . .. 2 (count):


RowKey
1:2:3:count: 3
1:2:count: 2

:


  • : « ?» => (n)
  • : : , , «count» .. . (1)
  • : : , - «» . , , , => O(n)
  • , «count», , -

2 , « ». «» 3 (col).
« »: ! – , 1 (, , , «// ..»). «», NoSQL-, HBase :


RowKey
: 1: 1: 1
: 1: 1

. , :


  • : , , , «»: , True, – False => O(1)
  • : : : «ID » => O(1)
  • : : «ID » => O(1)

, , , , . , …


- . ? userID.friendID? ( 4(row)):


RowKey
.: 1
.: 1
.: 1
.: 1
.: 1

, , (1). 3 - .


«». , 4 , , , ( , HBase ). , . , userID friendID, , , . ( 5(hash)):


RowKey
dc084ef00e94aef49be885f9b01f51c01918fa783851db0dc1f72f83d33a5994: 1
dc084ef00e94aef49be885f9b01f51c0f06b7714b5ba522c3cf51328b66fe28a: 1
dc084ef00e94aef49be885f9b01f51c00d2c2e5d69df6b238754f650d56c896a: 1
1918fa783851db0dc1f72f83d33a59949ee3309645bd2c0775899fca14f311e1: 1
1918fa783851db0dc1f72f83d33a5994dc084ef00e94aef49be885f9b01f51c0: 1

, , , 4 – (1).
, :


1 (default)O(n)O(n)O(n)
2 (count)O(1)O(n)O(n)
3 (column)O(1)O(1)O(1)
4 (row)O(1)O(1)O(1)
5 (hash)O(1)O(1)O(1)

, 3-5 . , , , «», « ». 3. , , .



– . « » , (n). , , , « », «-». «-» :


  • ,

, , :


  • . n. " " – . , « » HBase . – «-»
  • . «», , . = - , «», – «». , «» «» ( 1 2). .
  • . . – ( «» , ). .

5 , , . n , , 5 .
n= 5. «» ID-:



{0: [1], 1: [4, 5, 3, 2, 1], 2: [1, 2], 3: [2, 4, 1, 5, 3], 4: [2, 1]} #  15 

{0: [1, 10800], 1: [5, 10800, 2, 10801, 4, 10802], 2: [1, 10800], 3: [3, 10800, 1, 10801, 5, 10802], 4: [2, 10800]} #  18  

{0: [1], 1: [1, 3, 2, 5, 4], 2: [1, 2], 3: [4, 1, 2, 3, 5], 4: [1, 2]} #  15 

, ID, 10 000 – , False. , «» .


Windows 10, - HBase, – Python Jupyter Notebook. 2 CPU 2 . , « », «» Python. HBase happybase, (MD5) 5 — hashlib


n = 10, 30, …. 170 – ( n) - ( 15 ).


, . . n, « » , «» , ( ).



– , . – .

3-5 «-», .
2 , , 2 3-5. , – - / HBase 2 . , .
1 , .
.

3-5 – , . 1 2 . 2 – - «count», n . - , . , ( , 1 2, ) ( " ").


– .



. 3-5 .
, , 4 5, , , 3. , – , , .


1 2, , . 2 1 – - - «» count.


:


  • 3-5 , HBase; .
  • 4 5 . , 5 . , .
  • , «-» , .


. , ( ). , thrift, happybase, , Python ( , ), HBase, Windows 10 .. , . « » .


In conclusion - recommendations for anyone who is just starting to design data models in HBase: abstract from previous experience with relational databases and remember the “commandments”:


  • When designing, we go from the task and data manipulation patterns, and not from the domain model
  • Effective access (without full table scan) - only by key
  • Denormalization
  • Different lines may contain different columns
  • The dynamic composition of the columns

All Articles