OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services

Published in The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022

Recommended citation: Liu, Xiao & Yin, Da & Zheng, Jingnan & Zhang, Xingjian & Zhang, Peng & Yang, Hongxia & Yuxiao, Dong & Tang, Jie. (2022). OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services. 3418-3428. 10.1145/3534678.3539210. http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD22-Liu-et-al-OAG-BERT.pdf

Academic knowledge services have substantially facilitated the development of the science enterprise by providing a plenitude of efficient research tools. However, many applications highly depend on ad-hoc models and expensive human labeling to understand scientific contents, hindering deployments into real products. To build a unified backbone language model for different knowledge-intensive academic applications, we pre-train an academic language model OAG-BERT that integrates both the heterogeneous entity knowledge and scientific corpora in the Open Academic Graph (OAG) – the largest public academic graph to date. In OAG-BERT, we develop strategies for pre-training text and entity data along with zero-shot inference techniques. In OAG-BERT, we develop strategies for pre-training text and entity data along with zero-shot inference techniques. Its zero-shot capability furthers the path to mitigate the need of expensive annotations. OAG-BERT has been deployed for real-world applications, such as the reviewer recommendation function for National Nature Science Foundation of China (NSFC) – one of the largest funding agencies in China – and paper tagging in AMiner. All codes and pre-trained models are available via the CogDL toolkit.

Download paper here

Recommended citation: Liu, Xiao & Yin, Da & Zheng, Jingnan & Zhang, Xingjian & Zhang, Peng & Yang, Hongxia & Yuxiao, Dong & Tang, Jie. (2022). OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services. 3418-3428. 10.1145/3534678.3539210.