Map Of The Internet
Hello everyone!
I want to introduce you Map or clustering of more than 350 thousand sites in accordance with the transitions of users between them. The size of the circle is determined by website traffic, color – nationality and position on the map – it links to other sites. If the two sites have a steady flow of users between them, they will "try" to stay closer to each other. After completion of the algorithm, the map can observe clusters of sites (clusters) with common users.
For example, if you enter in a search habrahabr.ru you can see that dirty.ru and leprosorium.ru in the same "constellation", and away livejournal.ru. This suggests that those who are reading this text, also with high probability visits these sites (relative to average user of the Russian Internet of course).
An even more interesting example of clustering can be seen at the bottom of the card, between the purple and yellow Japan Brazil: there are pornostranka size comparable to the whole Euroatom. Interestingly, being quite competent on the issue under consideration, within a large pornocasera is possible to distinguish thematic-clusters of a smaller size.
For those interested in a brief technical description – welcome under kat
The whole project is written in C# and consists of three parts: program clustering, the program will generate the map tiles and web site. Each part deserves a separate consideration and, if there is interest, I could then tell more about them.
The original data was taken from Alexa's, they represent a record of attendance, upstream and downstream transitions users (upstream – where they came from, downstream is where left). After normalization, we obtain a weighted undirected graph with 350 thousands of vertices and 2 million edges.
The calculation of such a graph is a challenging computational task and, therefore, was written a special module for the GPU, but luckily he didn't. After a tricky optimizations, all calculations took several weeks of continuous operation powerful, but still a household iron.
In simple terms, the algorithm was step-by-step moving sites on the map in accordance with the forces acting on them. A lot of user clicks – strong force tries to bring together the sites; if the sites are located too close, the repulsive force, etc. More details can be found in the work reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-189.pdf
The main problem was the enormous computational complexity of such algorithm. After the solution of the problem "in a forehead", at each step need to compute the superposition of the forces for each site, i.e. to calculate the forces for each pair of sites, and such pairs of about 122 billion (not bad for one step, right?). Was therefore carried out hard optimization algorithm and the complete parallelization of computations. Fortunately the platform .net was surprisingly good for this kind of fun.
The second step is generation of the tiles. Tile is a small image 256 x 256 pixels of the map image for google maps, yandex and other services. In General, nothing complicated – I generate a big picture, cut it into squares of desired size. But these images were almost 30 million Even simply copy or remove the directory with the tiles in windows takes two days. And pour them into hosting a separate issue.
The third stage is to install the google maps engine, to put it all together and make it show on the map. Here in General there is nothing complicated, although there were some difficulties with the projection and positioning of the card.
The last stage is the choice of hosting and release. Here too has not done without adventures. Now everything is in the cloud Amazon and it was much easier and cheaper than I imagined.
In General, I have accumulated some experience, and I'll be glad to share with dear community. Of course, in General, nothing special, on habré there are some really interesting projects and non-trivial solutions, but still, I think many may be interested.
Also I'll be glad to any ideas, feedback, criticism – any feedback!
Thank you all for your kind words! For me it was very important to get support from a professional community of people who not only see the result, but I understand how much work has been done to achieve it. Also thanks Habru for the fact that in Runet there is a platform which unites all of us!
Continued: Web architecture Internet Card
Article based on information from habrahabr.ru
I want to introduce you Map or clustering of more than 350 thousand sites in accordance with the transitions of users between them. The size of the circle is determined by website traffic, color – nationality and position on the map – it links to other sites. If the two sites have a steady flow of users between them, they will "try" to stay closer to each other. After completion of the algorithm, the map can observe clusters of sites (clusters) with common users.
For example, if you enter in a search habrahabr.ru you can see that dirty.ru and leprosorium.ru in the same "constellation", and away livejournal.ru. This suggests that those who are reading this text, also with high probability visits these sites (relative to average user of the Russian Internet of course).
An even more interesting example of clustering can be seen at the bottom of the card, between the purple and yellow Japan Brazil: there are pornostranka size comparable to the whole Euroatom. Interestingly, being quite competent on the issue under consideration, within a large pornocasera is possible to distinguish thematic-clusters of a smaller size.
For those interested in a brief technical description – welcome under kat
Engineering
The whole project is written in C# and consists of three parts: program clustering, the program will generate the map tiles and web site. Each part deserves a separate consideration and, if there is interest, I could then tell more about them.
The original data was taken from Alexa's, they represent a record of attendance, upstream and downstream transitions users (upstream – where they came from, downstream is where left). After normalization, we obtain a weighted undirected graph with 350 thousands of vertices and 2 million edges.
The calculation of such a graph is a challenging computational task and, therefore, was written a special module for the GPU, but luckily he didn't. After a tricky optimizations, all calculations took several weeks of continuous operation powerful, but still a household iron.
In simple terms, the algorithm was step-by-step moving sites on the map in accordance with the forces acting on them. A lot of user clicks – strong force tries to bring together the sites; if the sites are located too close, the repulsive force, etc. More details can be found in the work reports-archive.adm.cs.cmu.edu/anon/1998/CMU-CS-98-189.pdf
The main problem was the enormous computational complexity of such algorithm. After the solution of the problem "in a forehead", at each step need to compute the superposition of the forces for each site, i.e. to calculate the forces for each pair of sites, and such pairs of about 122 billion (not bad for one step, right?). Was therefore carried out hard optimization algorithm and the complete parallelization of computations. Fortunately the platform .net was surprisingly good for this kind of fun.
The second step is generation of the tiles. Tile is a small image 256 x 256 pixels of the map image for google maps, yandex and other services. In General, nothing complicated – I generate a big picture, cut it into squares of desired size. But these images were almost 30 million Even simply copy or remove the directory with the tiles in windows takes two days. And pour them into hosting a separate issue.
The third stage is to install the google maps engine, to put it all together and make it show on the map. Here in General there is nothing complicated, although there were some difficulties with the projection and positioning of the card.
The last stage is the choice of hosting and release. Here too has not done without adventures. Now everything is in the cloud Amazon and it was much easier and cheaper than I imagined.
In General, I have accumulated some experience, and I'll be glad to share with dear community. Of course, in General, nothing special, on habré there are some really interesting projects and non-trivial solutions, but still, I think many may be interested.
Also I'll be glad to any ideas, feedback, criticism – any feedback!
Thank you all for your kind words! For me it was very important to get support from a professional community of people who not only see the result, but I understand how much work has been done to achieve it. Also thanks Habru for the fact that in Runet there is a platform which unites all of us!
Continued: Web architecture Internet Card