Affiliation: Professor, Computer Science & Technology Programs, BNU-HKBU United International College, China
Title: List data extraction in a probability & similarity framework
In this speech, we introduce a method to harvest the data in the Web list. We first identify the shortcoming in the existing work. Then a method (PACS) in a probability and column similarity framework is introduced. In PACS, the lists are first segmented with conditional random field (CRF). Thereafter the lists are aligned in a divide & conquer way. At each step, the column with the largest columnity is identified by including the probability Separability, filed similarity and field location. The short portions of the lists are then re-merge, re-split and re-aligned. Finally the null values are inserted. The experimental results show that the method achieves results better than existing methods.
Weifeng SU is an associate professor at the Computer Science Programme, BNU-HKBU United Inernational College (UIC).
He received his PhD in the Department of Computer Science and Engineering at the Hong Kong University of Science and Technology in 2007, under the supervision of Prof. Fred Lochovsky. He obtained his master degree from Xiamen Unviersity in 2002 and Bachelor degree from China University of Petroleum in 1995.
His research interests include Deep Web, Data Mining, Machine Learning, Word Sense Disambiguation, and Natural Language Processing.