Tuesday, April 26, 2011

Taming WikiPedia Category Hierarchy


Taming WikiPedia Category Hierarchy

Wikipedia's category system can be thought of as consisting of overlapping trees. Any category may branch into subcategories, and it is possible for a category to be a subcategory of more than one parent. (A is said to be a parent category of B when B is a subcategory of A).

Some pointers on categorization:
  • A category can be present in more than one parent category.
  • Cycles are present in the wiki graph like A --> B --> C --> D --> A
  • Every wikipedia article should belong to at-least 1 category.
  • For any category there exist multiple paths from Root to that category.
  • Wiki articles are placed under these overlapping trees.

[] Articles
[+] Articles with resourced statements
[+] Featured articles
[+] Fundamental categories
[+] Good articles
[+] Lists
[+] Main topic classifications
[+] Spoken articles
[+] Timelines


  • A snapshot of a wiki graph













Problem Statement
For a given category need to find its root parent category/categories among the following
categories(under Main Topic Classifications)

Main topic classifications
[+] Agriculture
[+] Applied sciences
[+] Arts
[+] Belief
[+] Business
[+] Chronology
[+] Culture
[+] Education
[+] Environment
[+] Geography
[+] Health
[+] History
[+] Humanities
[+] Language
[+] Law
[+] Life
[+] Mathematics
[+] Nature
[+] People
[+] Politics
[+] Science
[+] Society
[+] Technology


Our Approach
We approach the given problem considering following points

We need to remove cycles from the loop, otherwise it will become
an infinite loop
We need to avoid multiple paths to a category from any
root categories.
We need to remove sibling paths as they also increase
complexity and then too lead to the same parent.
Need to create a graph which points only downwards not vice
versa and without cycles (as given in the image)
Sub categories should fall under sub levels , so that this graph
can be visualized as a multi-root Tree.
























We follow the graph colouring principles from Red-Black Tree and traverse our graph in
Breadth First manner with following rules:











  • All processed nodes are coloured in Black.
  • All leafs in process are coloured Grey.
  • Edge is possible only from Black --> Grey node not vice versa.
  • All the leafs of a level are coloured Black after processing.

Our Observation


Out of 1.3 million unique categories, we covered around 532,000
categories with 745,000 edges in our graph. The created wikigraph
has 16 level hierarchy. To persist this graph we are using OrientDB.
On a machine with 4 core - 16 GB ram the graph can be created
in T<25 mins. Now we can identify with root categories for a given
category by graph connectivity algorithms. OrientDb also provides
queries for the same, hence we are using the queries to find connection
of a category with the root nodes.