Hands On: NELL — Never Ending Language Learner


“One of the great technical challenges in big data is to construct computer systems that learn continuously over years, from a continuing stream of diverse data, improving their competence at a variety of tasks, and becoming better learners over time.”
This introduced a recent ACM webinar on Carnegie Mellon University’s Never-Ending Language Learner (NELL)  a machine learning system that runs 24 hours per day learning to read the web. Each day NELL extracts or “reads” facts from the web, and integrates these into its growing network of beliefs. Nell is in reality software running on a cluster of computers at Carnegie Mellon University  NELL attracted significant attention in 2010 as the result of a  New York Times article, “Aiming To Learn As We Do, A Machine Teaches Itself.”

NELL uses a coupled semi-supervised bootstrapping approach to learn new facts from the texts it analyzes. Starting with an initial ontology and a small number of “seeds” for each ontology category, NELL can independently learn new categories and relations.

NELL has been running 24 hours a day for over three years and has been in essentially continuous operation since January 2010. For the first 6 months NELL was run without any human supervision, learning to extract instances of a few hundred categories and relations, resulting in a knowledge base containing approximately a third of a million extracted instances of these categories and relations. At that point, NELL had improved in its ability to read three quarters of the categories and relations exhibiting precision in the range 90% to 99%, but it had become inaccurate in extracting instances of the remaining quarter of the ontology. Human supervision was then added to the process to improve precision.

The result  is a collection of 50 million interconnected beliefs that NELL is considering, hundreds of thousands of learned phrasings, morphological features, and web page structures.
You can track NELL’s progress at http://rtw.ml.cmu.edu. and learn more about NELL by watching this video featuring Professor Tom Mitchell who leads the team that developed NELL.

For a more accessible and shorter presentation, check out this shorter presentation.

Meet NELL. See NELL Run, Teach NELL How To Run (Demo, TCTV)

Currently the  community interacts with NELL via a very simple text interface. However, Mitchel has developed some novel alternative ideas including a game-like interface called Polarity.
For now you can help NELL learn via Twitter @cmunell or try asking NELL. You can browse NELL’s current categories and relations from the Ask NELL page as well.


I do some work with analyzing texts and from time to time would like to have access to a ready made ontology or knowledge about items that might appear in texts my software examines. NELL provides an easy to use JSON query interface that will be familiar to most web application programmers.

Ok, a couple of upfront warnings, the NELL JSON interface is both experimental and a bit unusual. The NELL JSON page states:

NOTE: This API is still largely experimental and can be expected to change as both the project and usage patterns evolve.”

“One source of complexity for this API comes from the fact that the arguments given as queries or returned as results may be of one of three types: concepts, tokens, or literals.”

Plus the database seems pretty incomplete when you start poking at it.
NELL queries are either category instance queries or relation instance queries.

Here’s a simple example of what you can do with NELL. Say we want to identify articles that might be about competitors to a given company. NELL knows about companies and their relationships. The following short Mathematica code snippet queries NELL, captures everything NELL Knows about Google, and then parses out cases of “competeswith” to find Google’s competitors.

qNELL = Import[
facts = StringCases[qNELL, Shortest[" \"ent1\" : " ~~ e1___ ~~ ",

        \"predicate\" : " ~~ p___ ~~ ",

        \"ent2\" : " ~~ e2___ ~~ ","] -> {e1, p, e2}];
Select[facts, #[[2]] == "\"competeswith\"" &] // TableForm

The results are shown below:

"concept:website:google" "competeswith" "concept:biotechcompany:news_corp"
"concept:website:google" "competeswith" "concept:blog:digg"
"concept:website:google" "competeswith" "concept:company:paypal"
"concept:website:google" "competeswith" "concept:company:yahoo_sponsored_search"
"concept:website:google" "competeswith" "concept:magazine:apple"
"concept:website:google" "competeswith" "concept:company:yahoo_overture"
"concept:website:google" "competeswith" "concept:company:microsoft_corporation"
"concept:website:google" "competeswith" "concept:company:netscape"
"concept:website:google" "competeswith" "concept:company:qualcomm_inc"
"concept:website:google" "competeswith" "concept:company:live"
"concept:website:google" "competeswith" "concept:company:print"
"concept:website:google" "competeswith" "concept:company:skype"
"concept:website:google" "competeswith" "concept:company:garmin"
"concept:website:google" "competeswith" "concept:company:map"
"concept:website:google" "competeswith" "concept:company:firefox"
"concept:website:google" "competeswith" "concept:website:myspace"
"concept:website:google" "competeswith" "concept:website:msn_"
"concept:website:google" "competeswith" "concept:website:technorati"
"concept:website:google" "competeswith" "concept:website:yahoo"
"concept:website:google" "competeswith" "concept:company:rss"
"concept:website:google" "competeswith" "concept:company:mobile"
"concept:website:google" "competeswith" "concept:website:msn"
"concept:website:google" "competeswith" "concept:website:facebook"
"concept:website:google" "competeswith" "concept:website:aol"
"concept:website:google" "competeswith" "concept:company:amazon"
"concept:website:google" "competeswith" "concept:website:wikipedia_wikiproject_lists"
"concept:website:google" "competeswith" "concept:company:flickr001"
"concept:website:google" "competeswith" "concept:company:twitter"
"concept:website:google" "competeswith" "concept:company:local"
"concept:website:google" "competeswith" "concept:company:video001"
"concept:website:google" "competeswith" "concept:website:live_search"
"concept:website:google" "competeswith" "concept:website:bloglines"
"concept:website:google" "competeswith" "concept:website:yahoo_search"
"concept:website:google" "competeswith" "concept:website:msn_search"
"concept:website:google" "competeswith" "concept:company:sun"
"concept:website:google" "competeswith" "concept:website:microsoft_live"
"concept:website:google" "competeswith" "concept:company:japan"
"concept:website:google" "competeswith" "concept:company:oracle"

Some funny results, but they are not entirely bad. If we run the same query for Apple computer we get:

"concept:magazine:apple" "competeswith" "concept:publication:sunday_business_post"
"concept:magazine:apple" "competeswith" "concept:company:dell"
"concept:company:apple" "competeswith" "concept:company:sun"
"concept:organization:apple_inc" "competeswith" "concept:company:china_mobile"
"concept:organization:apple_inc" "competeswith" "concept:biotechcompany:inc"
"concept:organization:apple_inc" "competeswith" "concept:company:dell"
"concept:organization:apple_inc" "competeswith" "concept:website:google"

Obviously NELL isn’t up on the latest mobile industry news because querying for “Samsung” produces no results at all. Hmm.

The NELL database is seemingly skewed by the people that have worked on it and have interacted with NELL online. NELL knows that “death” could refer to a musical genre that is a type of “metal for example, but it doesn’t really have a concept of death as the ending of life.

You can learn some interesting stuff from querying NELL though. For example, NELL  thinks “entity” : “concept:monarch:sergey_brin” is “referredToByToken”  “sergey_brin” based on observations of the literal strings, “Sergey Brin”, “sergey brin”, and “Sergey-Brin” in observed texts. OK, I’m not sure why NELL is labeling Sergey a “monarch” but by looking for these literal strings in novel observed texts we can quickly label them with possible NELL concept categories. NELL’s learned knowledge is ultimately stored in terms of these abstract concepts, and there is a many-to-many mapping from these to the actual literal noun phrase strings read from text.  This allows NELL to capture polysemy, because a single word or phrase can have multiple meanings. In actuality, NELL has a third intermediate layer “tokens”,  which are case-insensitive and punctuation-insensitive versions of a literal string. This three level architecture, concept, token, literal will be familiar to anyone working in the field.

Semantically Annotated Noun Phrases from NELL

If you are working with texts perhaps the most interesting thing available from NELL right now are the collections of annotated noun phrases and semantic categories. The ClueWeb09 Dataset consists of about 1 billion web pages in ten languages that were collected in January and February 2009. NELL found 13.4 million noun phrases from the ClueWeb09 text corpus. Other NP files available include:
  • Labels for 11 million noun phrases containing up to three words, extracted automatically with fairly good accuracy from the KBP 2012 source document collection:
  • Labels for 10.4 million noun phrases containing four or more words, extracted automatically from the KBP 2012 source document collection
  • Labels for 6.5 thousand noun phrases extracted from the training and evaluation annotations made available for the KBP 2012 track. This set of labelings includes the entity IDs from the annotations for easy cross-referencing
  • Labels for 13 million noun phrases containing up to three words, extracted automatically with fairly good accuracy from the KBP 2013 source document collection
  • Labels for 13.5 million noun phrases containing four or more words, extracted automatically from the KBP 2013 source document collection

The Annotated Noun Phrase files list the noun phrases, one per line, along with NELL’s category assignments and their confidence scores. Each line begins with a case-sensitive noun phrase, followed by a tab, followed by a list of one or more tab separated category-confidence pairs.The category-confidence pairs are listed from highest to lowest confidence, range from 0.5 to 1.0, and are not calibrated probabilities. These can be used to assign conceptual categories to text fragments and extracted phrases.

NELL’s semantic categories are available as a simple CSV file. These contain information about the hierarchical relationships between knowledge categories. For example, NELL knows that:

magazineISA mediacompany  organization humanagent company agent publication

All of this data is made freely available by the NELL research project, for anybody who would like to use it for any legal purpose whatsoever, so both academic and commercial researchers are free to use this data. Nice!


“Every Belief in The KB” File

 Another resource available from the NELL project is the “Every Belief in The KB” File. This file contains all of the  category or relation instances that NELL believes to be true. Nominally, each belief is an (Entity, Relation, Value) tripple; instances of relations have the form (George Harrison, playsInstrument, Guitar), and instances of categories have the form (Guitar, generalizations, muscialInstrument) and consists of one line in this file.
However caution is require in using this data since concept names can be misleading:

Concepts have names like “concept:coach:peyton_manning”. These names often capture the right meaning, but they can be misleading as well. In this example, there is no gauarantee that NELL believes concept:coach:peyton_manning, and NELL may believe that the concept belongs to other categories not mentioned. It is also possible that NELL has yet to be certain that it belongs in any category at all. Additionally, it could be the case that NELL is confused about which literal strings refer to which concepts, and it may be that NELL believes that both “Peyton Manning” and “Jim Caldwell” can refer to this one concept. It might not be clear whether NELL has mistaken Jim Caldwell for a football player or Peyton Manning for a coach. Therefore, it is essential to always look at the set of literal strings that refer to a concept, and to look at the set of categories to which a concept belongs in order to determine its true category membership. Simply stripping off the “concept:” prefix and category name will lead to incomplete and erroneous information.

Importantly this file contains the literal strings which lead to the formation of the categories and relations that NELL knows and it could also be a useful seed resource for text understanding systems.




Meet NELL. See NELL Run, Teach NELL How To Run (Demo, TCTV)



Lots more about NELL here: http://rtw.ml.cmu.edu/rtw/publications

2 Responses

  1. Each day NELL extracts or ‘reads’ facts from the web, and integrates these into its growing network of beliefs.

    In the early stages of designing AI Minds, we assume that human users will input true facts into the AI and not tell lies to the AI. As we design the ability of an AI to engage in automated reasoning, we let ideas influence each other in the AI and we do not yet insist that each idea be a strongly held belief. For a robot AI to operate “in real life” (IRL), it will need the noetic strength to withstand the harmful pressure of lies and falsehoods.

    A single idea or assertion, such as “Birds have wings,” is not enough to constitute a belief in an AI Mind. A mind with massive parallelism is able to weigh the relative strengths of positive assertions and their negations in memory to arrive at a sum total of belief about a fact. An AI mindgrid operating under the stricture of the von Neuman bottleneck may not be able to calculate the relative strengths of assertions and their negations so as to hold a strong, unassailable belief. Massive parallelism (q.v.) or “maspar” may be the conditio sine qua non if we want a mind to be powerful enough to hold its own beliefs.

  1. September 28, 2013

    […] Peter Meet […]

Leave a Reply