هل يموت RuTracker؟ نقوم بتحليل التوزيعات

أي نشاط يولد البيانات. أياً كان ما تفعله ، فمن المحتمل أن يكون لديك بين يديك مخزن من المعلومات المفيدة الخام ، أو على الأقل الوصول إلى مصدره.


اليوم الفائز هو الذي يتخذ القرارات بناءً على بيانات موضوعية. أصبحت مهارات المحلل أكثر ملاءمة من أي وقت مضى ، وتوافر الأدوات اللازمة في متناول يدك يسمح لك دائمًا أن تكون متقدمًا بخطوة. هذه مساعدة في ظهور هذه المقالة.


هل لديك عملك الخاص؟ أو ربما ... على الرغم من ذلك ، لا يهم. عملية استخراج البيانات لا نهاية لها ومثيرة. وحتى مجرد البحث بشكل جيد على الإنترنت ، يمكنك العثور على مجال للنشاط.


إليك ما لدينا اليوم - قاعدة بيانات توزيع XML غير رسمية لـ RuTracker.ORG. يتم تحديث قاعدة البيانات كل ستة أشهر وتحتوي على معلومات حول جميع التوزيعات لتاريخ وجود هذا تعقب سيل.


ماذا يمكن أن تخبر أصحاب rutracker؟ والمتواطئون المباشرون للقرصنة على الإنترنت؟ أو مستخدم عادي مغرم أنيمي ، على سبيل المثال؟


هل تفهم ما أقصد؟


تنصل

. open source .


big data.


المكدس - R ، Clickhouse ، Dataiku


تمر أي تحليلات بعدة مراحل رئيسية: استخراج البيانات وإعدادها ودراسة البيانات (التصور). كل مرحلة لها أداتها الخاصة. لأن مكدس اليوم:


  1. ر. نعم ، لا يحظى بشعبية ويقل عن بايثون. ولكن لا تزال نظيفة وجميلة مع dplyr و ggplot2. ولد للتحليلات وعدم استخدامه جريمة.
  2. كليك هاوس. عمود DBMS التحليلي. سمعت بالتأكيد: "كليكهاوس لا يبطئ" أو "سرعة على حافة الخيال". الناس لا يكذبون وسنرى ذلك. المسؤول عن اللحظية.
  3. Dataiku. منصة لمعالجة وتصور وتحليل الأعمال التنبؤية.

مراجعة: يعمل Dataiku على Linux و Mac. يتوفر إصدار مجاني بحد أقصى للمستخدمين 3 أشخاص. الوثائق هنا .


والمثير للدهشة أنه لا يوجد حتى الآن ضجيج أو ضجيج ، إذا كنت ترغب ، في عدم قابلية هذه المنصة للمقاومة على موارد اللغة الروسية وحتى على حبري. سآخذ لتصحيح سوء الفهم هذا وأطلب منك تهنئة dataiku على المبادرة.


البيانات الضخمة - مشاكل كبيرة


xml– 5 . – rutracker.org, (2005 .) 2019 . 15 !


R Studio – ! . , .


, R. Big Data, Clickhouse … , xml–. . .


. Dataiku DSS . – 10 000 . . , . , 200 000 .


, . .


صورة


. : content — json.


content, . – .


recipe — . , . json .


صورة


. , , + dataiku.


recipe, — .


صورة


csv Clickhouse.



Clickhouse 15 rutracker-a.


?


SELECT ROUND(uniq(torrent_id) / 1000000, 2) AS Count_M
FROM rutracker

┌─Count_M─┐
│    1.46 │
└─────────┘
1 rows in set. Elapsed: 0.247 sec. Processed 25.51 million rows, 204.06 MB (103.47 million rows/s., 827.77 MB/s.)

1.5 25 . 0.3 ! .


, , .


SELECT COUNT(*) AS Count
FROM rutracker
WHERE (file_ext = 'epub') OR (file_ext = 'fb2') OR (file_ext = 'mobi')

┌──Count─┐
│ 333654 │
└────────┘
1 rows in set. Elapsed: 0.435 sec. Processed 25.51 million rows, 308.79 MB (58.64 million rows/s., 709.86 MB/s.)

300 — ! , . .


SELECT ROUND(SUM(file_size) / 1000000000, 2) AS Total_size_GB
FROM rutracker
WHERE (file_ext = 'epub') OR (file_ext = 'fb2') OR (file_ext = 'mobi')

┌─Total_size_GB─┐
│        625.75 │
└───────────────┘
1 rows in set. Elapsed: 0.296 sec. Processed 25.51 million rows, 344.32 MB (86.24 million rows/s., 1.16 GB/s.)

– 25 . , ?


R


R. , DBI ( ). Clickhouse.


R
library(DBI) #    , ... Clickhouse
library(dplyr) #   %>%

# 
library(ggplot2) 
library(ggrepel)
library(cowplot)
library(scales)
library(ggrepel)

#   localhost:9000 
connection <- dbConnect(RClickhouse::clickhouse(), host="localhost", port = 9000)

, . dplyr .


? rutracker.org .


R
years_stat <- dbGetQuery(connection,
                       "SELECT
                          round(COUNT(*)/1000000, 2) AS Files,
                          round(uniq(torrent_id)/1000, 2) AS Torrents,
                          toYear(torrent_registred_at) AS Year
                        FROM rutracker
                        GROUP BY Year")

ggplot(years_stat, aes(as.factor(Year), as.double(Files))) +
  geom_bar(stat = 'identity', fill = "darkblue", alpha = 0.8)+

  theme_minimal() +
  labs(title = "     RuTracker", subtitle = "  2005 - 2019\n")+

  theme(axis.text.x = element_text(angle=90, vjust = 0.5),
        axis.text.y = element_text(),

        axis.title.y = element_blank(),
        axis.title.x = element_blank(),

        panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(size = 0.9),
        panel.grid.minor.y = element_line(size = 0.4),

        plot.title = element_text(vjust = 3, hjust = 0, family = "sans", size = 16, color = "#101010", face = "bold"),
        plot.caption = element_text(vjust = 3, hjust = 0, family = "sans", size = 12, color = "#101010", face = "bold"),
        plot.margin = unit(c(1,0.5,1,0.5), "cm"))+

    scale_y_continuous(labels = number_format(accuracy = 1, suffix = " "))

ggplot(years_stat, aes(as.factor(Year), as.integer(Torrents))) +
  geom_bar(stat = 'identity', fill = "#008b8b", alpha = 0.8)+

   theme_minimal() +
   labs(title = "     RuTracker", subtitle = "  2005 - 2019\n", caption = "*  ")+

   theme(axis.text.x = element_text(angle=90, vjust = 0.5),
          axis.text.y = element_text(),

          axis.title.y = element_blank(),
          axis.title.x = element_blank(),

          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(size = 0.9),
          panel.grid.minor.y = element_line(size = 0.4),

          plot.title = element_text(vjust = 3, hjust = 0, family = "sans", size = 16, color = "#101010", face = "bold"),
          plot.caption = element_text(vjust = -3, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),
          plot.margin = unit(c(1,0.5,1,0.5), "cm")) +

     scale_y_continuous(labels = number_format(accuracy = 1, suffix = " "))

صورةصورة


2016 . , 2016 rutracker.org . , .


, . , .


.


R
extention_stat <- dbGetQuery(connection,
       "SELECT toYear(torrent_registred_at) AS Year,
              COUNT(tracker_id)/1000 AS Count,
              ROUND(SUM(file_size)/1000000000000, 2) AS Total_Size_TB,
              file_ext
         FROM rutracker
         GROUP BY Year, file_ext
         ORDER BY Year, Count")

#       
TopExt <- function(x, n) {
  res_tab <- NULL
  # 2005  2006, ..   
  for (i in (3:15)) {
    res_tab <-bind_rows(list(res_tab,
          extention_stat %>% filter(Year == x[i]) %>%
          arrange(desc(Count), desc(Total_Size_TB)) %>%
          head(n)
      ))
  }
  return(res_tab)
}

years_list <- unique(extention_stat$Year)
ext_data <- TopExt(years_list, 5)

ggplot(ext_data, aes(as.factor(Year), as.integer(Count),  fill = file_ext)) +
  geom_bar(stat = "identity",position="dodge2", alpha =0.8, width = 1)+

  theme_minimal() +
  labs(title = "     RuTracker", 
          subtitle = "  2005 - 2019\n", 
          caption = "* -5   ", fill = "") +

   theme(axis.text.x = element_text(angle=90, vjust = 0.5),
          axis.text.y = element_text(),

          axis.title.y = element_blank(),
          axis.title.x = element_blank(),

          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(size = 0.9),
          panel.grid.minor.y = element_line(size = 0.4),

          legend.title = element_text(vjust = 1, hjust = -1, family = "sans", size = 9, color = "#101010", face = "plain"),
          legend.position = "top",

          plot.title = element_text(vjust = 3, hjust = 0, family = "sans", size = 16, color = "#101010", face = "bold"),
          plot.caption = element_text(vjust = -4, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),
          plot.margin = unit(c(1,0.5,1,0.5), "cm")) +

     scale_y_continuous(labels = number_format(accuracy = 0.5, scale = (1/1000), suffix = " "))+guides(fill=guide_legend(nrow=1))

صورة


. . .


rutracker-a. .


R
chapter_stat <- dbGetQuery(connection, 
      "SELECT 
             substring(forum_name, 1, position(forum_name, ' -')) Chapter, 
             uniq(torrent_id) AS Count, 
             ROUND(median(file_size)/1000000, 2) AS Median_Size_MB, 
             ROUND(max(file_size)/1000000000) AS Max_Size_GB, 
             ROUND(SUM(file_size)/1000000000000) AS Total_Size_TB 
        FROM rutracker WHERE Chapter NOT LIKE('\"%') 
        GROUP BY Chapter 
        ORDER BY Count DESC")

  chapter_stat$Count <- as.integer(chapter_stat$Count)

#     
AggChapter2 <- function(Chapter){
  var_ch <- str(Chapter)
  res = NULL
  for(i in (1:22)){
    select_str <-paste0(
    "SELECT 
           toYear(torrent_registred_at) AS Year, 
           substring(forum_name, 1, position(forum_name, ' -')) Chapter, 
           uniq(torrent_id)/1000 AS Count, 
           ROUND(median(file_size)/1000000, 2) AS Median_Size_MB, 
           ROUND(max(file_size)/1000000000,2) AS Max_Size_GB, 
           ROUND(SUM(file_size)/1000000000000,2) AS Total_Size_TB 
      FROM rutracker 
      WHERE Chapter LIKE('", Chapter[i], "%') 
      GROUP BY Year, Chapter 
      ORDER BY Year")
    res <-bind_rows(list(res, dbGetQuery(connection, select_str)))
                  }
  return(res)
}

chapters_data <- AggChapter2(chapter_stat$Chapter)

chapters_data$Chapter <- as.factor(chapters_data$Chapter)
chapters_data$Count <- as.numeric(chapters_data$Count)

chapters_data %>% group_by(Chapter)%>% 

ggplot(mapping = aes(x = reorder(Chapter, Total_Size_TB), y = Total_Size_TB))+
geom_bar(stat = "identity", fill="darkblue", alpha =0.8)+

  theme(panel.grid.major.x = element_line(colour="grey60", linetype="dashed"))+
  xlab('\n') + theme_minimal() +

  labs(title = "C   RuTracker-", 
          subtitle = "  2019\n")+
  theme(axis.text.x = element_text(),
       axis.text.y = element_text(family = "sans", size = 9, color = "#101010", hjust = 1, vjust = 0.5),

       axis.title.y = element_text(vjust = 2.5, hjust = 0, family = "sans", size = 9, color = "grey40", face = "plain"),
       axis.title.x = element_blank(),

       axis.line.x  = element_line(color = "grey60", size = 0.1, linetype = "solid"),

       panel.grid.major.y = element_blank(),
       panel.grid.major.x = element_line(size = 0.7, linetype = "solid"),
       panel.grid.minor.x = element_line(size = 0.4, linetype = "solid"),

       plot.title = element_text(vjust = 3, hjust = 1, family = "sans", size = 16, color = "#101010", face = "bold"),
       plot.subtitle  = element_text(vjust = 2, hjust = 1, family = "sans", size = 12, color = "#101010", face = "plain"),
       plot.caption = element_text(vjust = -3, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),

       plot.margin = unit(c(1,0.5,1,0.5), "cm"))+
   scale_y_continuous(labels = number_format(accuracy = 1, suffix = " "))+
   coord_flip()

صورة


. — — . , . , Apple.


R
chapters_data %>% group_by(Chapter)%>% 

ggplot(mapping = aes(x = reorder(Chapter, Count), y = Count))+
   geom_bar(stat = "identity", fill="#008b8b", alpha =0.8)+

   theme(panel.grid.major.x = element_line(colour="grey60", linetype="dashed"))+
   xlab('') + theme_minimal() +
   labs(title = "    RuTracker-", 
           subtitle = "  2019\n")+
   theme(axis.text.x = element_text(),
       axis.text.y = element_text(family = "sans", size = 9, color = "#101010", hjust = 1, vjust = 0.5),

       axis.title.y = element_text(vjust = 3.5, hjust = 0, family = "sans", size = 9, color = "grey40", face = "plain"),
       axis.title.x = element_blank(),

       axis.line.x  = element_line(color = "grey60", size = 0.1, linetype = "solid"),

       panel.grid.major.y = element_blank(),
       panel.grid.major.x = element_line(size = 0.7, linetype = "solid"),
       panel.grid.minor.x = element_line(size = 0.4, linetype = "solid"),

       plot.title = element_text(vjust = 3, hjust = 1, family = "sans", size = 16, color = "#101010", face = "bold"),
       plot.subtitle  = element_text(vjust = 2, hjust = 1, family = "sans", size = 12, color = "#101010", face = "plain"),
       plot.caption = element_text(vjust = -3, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),

       plot.margin = unit(c(1,0.5,1,0.5), "cm"))+
    scale_y_continuous(limits = c(0, 300), labels = number_format(accuracy = 1, suffix = " "))+
    coord_flip()

صورة


, , : -.
~15 .


R
library("RColorBrewer")
getPalette = colorRampPalette(brewer.pal(19, "Spectral"))

chapters_data %>% #filter(Chapter %in% chapter_stat$Chapter[c(4,6,7,9:20)])%>%
  filter(!Chapter %in% chapter_stat$Chapter[c(16, 21, 22)])%>%
  filter(Year>=2007)%>%

ggplot(mapping = aes(x = Year, y = Count, fill = as.factor(Chapter)))+
   geom_area(alpha =0.8, position = "fill")+

   theme_minimal() +
   labs(title = "   -", 
           subtitle = " ~15 ", fill = "")+
   theme(axis.text.x = element_text(vjust = 0.5),
          axis.text.y = element_blank(),

          axis.title.y = element_blank(),
          axis.title.x = element_blank(),

          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(size = 0.9),
          panel.grid.minor.y = element_line(size = 0.4),

          plot.title = element_text(vjust = 3, hjust = 0, family = "sans", size = 16, color = "#101010", face = "bold"),
          plot.caption = element_text(vjust = -3, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),
          plot.margin = unit(c(1,1,1,1), "cm")) +

     scale_x_continuous(breaks = c(2008, 2010, 2012, 2014, 2016, 2018),expand=c(0,0)) +
     scale_fill_manual(values = getPalette(19))

صورة


- — . — Apple , .
. .


, Rutracker-a. - rutracker.org.


R
unique_torr_per_day <- dbGetQuery(connection, 
          "SELECT toDate(torrent_registred_at) AS date, 
                          uniq(torrent_id) AS count
           FROM rutracker 
           GROUP BY date
           ORDER BY date")

unique_torr_per_day %>% 
ggplot(aes(format(date, "%Y"), format(date, "%j"), fill = as.numeric(count)))+
  geom_tile() +

  theme_minimal() +
  labs(title = "   RuTracker-a", 
          subtitle = " ~15 \n\n", 
          fill = "-   \n")+
      theme(axis.text.x = element_text(vjust = 0.5),
          axis.text.y = element_text(),

          axis.title.y = element_blank(),
          axis.title.x = element_blank(),

          panel.grid.major.y = element_blank(),
          panel.grid.major.x = element_line(size = 0.9),
          panel.grid.minor.x = element_line(size = 0.4),

          legend.title = element_text(vjust = 0.7, hjust = -1, family = "sans", size = 10, color = "#101010", face = "plain"),
          legend.position = c(0.88, 1.30),
          legend.direction = "horizontal",

          plot.title = element_text(vjust = 3, hjust = 0, family = "sans", size = 16, color = "#101010", face = "bold"),
          plot.caption = element_text(vjust = -3, hjust = 1, family = "sans", size = 9, color = "grey60", face = "plain"),
          plot.margin = unit(c(1,1,1,1), "cm"))+ coord_flip(clip = "off") +
          scale_y_discrete(breaks = c(format(as.Date("2007-01-15"), "%j"), 
                                      format(as.Date("2007-02-15"), "%j"), 
                                      format(as.Date("2007-03-15"), "%j"), 
                                      format(as.Date("2007-04-15"), "%j"), 
                                      format(as.Date("2007-05-15"), "%j"), 
                                      format(as.Date("2007-06-15"), "%j"), 
                                      format(as.Date("2007-07-15"), "%j"),
                                      format(as.Date("2007-08-15"), "%j"),
                                      format(as.Date("2007-09-15"), "%j"),
                                      format(as.Date("2007-10-15"), "%j"),
                                      format(as.Date("2007-11-15"), "%j"),
                                      format(as.Date("2007-12-15"), "%j")), 
          labels = c("", "", "", "", "", "","", "", "", "","",""), position = 'right') +
          scale_fill_gradientn(colours = c("#155220", "#c6e48b"))  + 

       annotate(geom = "curve", x = 16.5, y = 119, xend = 13, yend = 135, 
                   curvature = .3, color = "grey15", arrow = arrow(length = unit(2, "mm"))) +
       annotate(geom = "text", x = 16, y = 45, 
label = "     « » \n", 
hjust = "left", vjust = -0.75, color = "grey25") + 

       guides(x.sec = guide_axis_label_trans(~.x)) + 
       annotate("rect", xmin = 11.5, xmax = 12.5, ymin = 1, ymax = 366,
                       alpha = .0, colour = "white", size = 0.1) + 
       geom_segment(aes(x = 11.5, y = 25, xend = 12.5, yend = 25, colour = "segment"), 
                                  show.legend = FALSE)

صورة


2017 . (. GitHub ). 2016 , . .


. . – .
, content , , , 15 .


Dataiku


, : , , , .


صورة


, -. . – .


صورة


– .


صورة


: rutracker.org , , — 60. 2009 — 2014 .


. , , . .


, . .


صورة


, dataiku — . , , (R, Python), . .


, RuTracker, : , . . , . .


UPD: , recipe dataiku.


بشكل مشروط ، يمكن تقسيم الوصفة الواردة في هذه المقالة إلى قسمين: إعداد البيانات للتحليل في R وإعداد البيانات حول الأنيمي للتحليل مباشرة على النظام الأساسي.


مرحلة التحضير للتحليل في R

json- .


image


json-

. .


image


timestamp .


image


مرحلة إعداد بيانات الأنمي

, , . content — Descr_Data.


image


content

regexp , , , . , regexp dataiku .


image




All Articles