Mining DistroWatch.com Logs (Part 1)
Mining the logs from the famous DistroWatch.com website enables to formally assess the trends in the GNU/Linux ecosystem. In particular, this first part analyzes the popularity of Ubuntu with respect to the former predominance of Mandriva.
In a month from now, an algorithm called Data-Peeler will be presented at a prestigious conference which will be held in Atlanta, Georgia. This algorithm, developed in the French research team I work in, deals with data-mining. This computer science topic aims at extracting knowledge from data. In the case of Data-Peeler, the considered data are n-ary relations. The extracted knowledge binds subsets of the n domains that are simultaneously frequent.
Hence, once I have had implemented this algorithm, I needed an interesting real-life n-ary relation to assess the relevancy of the extracted knowledge. I immediately thought of DistroWatch.com. This popular website gathers comprehensive information about GNU/Linux, BSD and Solaris distributions. Interpreting the Free operating system trends may look less serious than the gene expression data analysis my team is accustomed to. However, I feel much more comfortable with it! That is why I wrote to Ladislav Bodnar, maintainer of DistroWatch.com. He kindly agreed to share its logs with me and wished me "Happy number crunching!".
On DistroWatch.com, every distribution is described on a separate page. I have considered that a visitor loading such a page is ``interested'' in the distribution. Ladislav analyzes every IP address contacting his server so that the country the connection comes from, is logged as well. Finally, timestamps enable to study the evolution of the interest granted to the different distributions along time. In the end, here is a wonderful ternary relation between distributions, countries and time. From it, Data-Peeler extracts patterns binding sets of distributions with sets of countries and set of time periods. Their meaning is the following: some users from those countries were interested in the same set of distributions along the same time segments.


Recent comments
3 hours 49 min ago
12 hours 10 min ago
13 hours 56 min ago
14 hours 43 min ago
20 hours 14 min ago
1 day 5 hours ago
1 day 6 hours ago
1 day 10 hours ago