NLP Should Go Beyond Commonsense Knowledge

NLP technology is all-pervasive for commonsense knowledge. There are many causes for this. Most of the internet and its data is about commonsense knowledge and world events, so NLP technology is developed over the data domain that is most easily available. But what about the scholarly domain with its rapidly growing body of knowledge produced worldwide? These are those specialized domains of knowledge in Science, Technology, Engineering, and Mathematics (STEM), all of which open up countless doors for NLP.

The opportunities are endless!

We are missing out on entirely other worlds of technology that present themselves when Natural Language Processing (NLP) is considered beyond the commonsense domain where it is all-pervasive. NLP here relates to the ability of machines to understand human (or natural) language. It is a subfield of Computer Science and Artificial Intelligence (AI). What we mean is that the commonsense knowledge domain we often seek to understand determines the smart information access NLP tools available to us that indeed make the knowledge processing aspect of our lives easier. This contradicts the common belief that NLP is “available to everyone and in all domains of knowledge.”

Think of the recent NLP-powered, Knowledge Graph (KG) [1] based search engine success stories of Facebook and Google in industry; and the NLP projects toward  large-scale KG-based commonsense reasoners as Babelfy, DBpedia Spotlight, and NELL (Never Ending Language Learning) in academia. Some of the NLP technology powering these KG-based systems, e.g. Named Entity Recognition (NER), are backed by over 3 decades of NLP research traceable to the Message Understanding Conference series from the late nineties [2]. Now consider the fact that in the scholarly domain, obtaining fine-grained entity-centric knowledge facilitated by well-established NLP systems is only a newly burgeoning research area of the current decade (see this research effort or this one).

As significant advances have been made on NLP on commonsense knowledge, we believe that scholarly domain-specific NLP will gain increasing attention in the years to come. This is owing to the digitalization of scholarly knowledge impetus via crowdsourcing that is growing, e.g. SciGraph, OpenAire, ResearchGraph, Microsoft Academic Graph, and TIB’s very own the Open Research Knowledge Graph (ORKG) led by Prof. Dr. Sören Auer. While expert-based crowdsourcing is effective to obtain high-quality data, it is not necessarily a scalable solution in the face of growing volumes of scientific literature, the processing of which would need the support of automated NLP techniques. Since the next-generation scholarly digital library (DL) infrastructures as TIB’s ORKG have already arrived, this creates more room for the NLP technologies to evolve and develop to support their scalable building.

The ORKG digital research and innovation infrastructure, argues for obtaining a semantically rich, interlinked KG representations of the “content” of the scholarly articles, and, specifically, only research contributions. With intelligent analytics enabled over such contributions-focused KGs, researchers can track research progress without the cognitive overhead that reading dozens of articles impose. Allard Oelen, a Research Assistant at L3S (Leibniz University Hannover) and TIB, will present the Open Research Knowledge Graph at the upcoming Knowledge Graph Conference (KGC) 2022. We invite you to attend his session which promises to be a truly stimulating walkthrough of the ORKG core interface and the intertwinement of human and machine intelligence.

Via expert crowdsourcing, scholarly information for digital libraries such as ORKG can be readily structured based on human judgements. However, the issue of needing an automated NLP system as a scalable complementary solution remains, one that could even serve the purpose of making it easier for experts to structure scholarly knowledge via drag-and-drop recommendations. Jennifer D’Souza, a Postdoctoral Researcher at TIB working on developing NLP solutions for the ORKG, will also offer a talk titled “The State of the Art on Knowledge Graph Construction from Text” at KGC 2022 focused on the existing technologies developed on the commonsense domain. Her work at the ORKG focuses on finding ways to transfer technologies from the commonsense domain to the scholarly domain. E.g., exploring NER on STEM data, Computer Science NER in the ORKG [3], or automated Leaderboard Extraction in AI [4] for the ORKG benchmarks feature.

The need for scalable solutions to extract knowledge from natural text puts NLP in an important position. What has been done for decades for commonsense knowledge becomes increasingly feasible for more complex tasks, specifically related to scholarly knowledge extraction. Also within the ORKG, machine intelligence in the form of NLP-powered features are actively used to assist users with creating structured research descriptions. The interplay between human intelligence (via crowdsourcing) and machine intelligence (via NLP) provides the best of two worlds – the quality and precision of humans combined with the scalability of machines. Facilitated by future research and technological advancement, NLP tools promise to offer ease of information access benefits even in the scholarly domain in the years to come.


  1. Ehrlinger, Lisa, and Wolfram Wöß. “Towards a definition of knowledge graphs.” SEMANTiCS (Posters, Demos, SuCCESS) 48.1-4 (2016): 2.
  2. Grishman, Ralph, and Beth M. Sundheim. “Message understanding conference-6: A brief history.” COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics. 1996.
  3. D’Souza, Jennifer, and Sören Auer. “Computer Science Named Entity Recognition in the Open Research Knowledge Graph.” arXiv preprint arXiv:2203.14579 (2022).
  4. Kabongo, Salomon, Jennifer D’Souza, and Sören Auer. “Automated Mining of Leaderboards for Empirical AI Research.” International Conference on Asian Digital Libraries. Springer, Cham, 2021.

PhD student in the Joint Lab of TIB and L3S Research Center.