1. Introduction to Chinese Lexical Analyzer ICTCLAS ICTCLAS大事记 [ICTCLAS3.0 2006 : Chapter 1 - Introduction / 1]
词是最小的能够独立活动的有意义的语言成分,但汉语是以字为基本的书写单位,词语之间没有明显的区分标记,因此,中文词法分析是中文信息处理的基础与关键。所有涉及中文内容处理的系统,如果没有一个好的中文词法分析系统支持,正确率都会受很大影响。汉语自动智能分词是中文信息处理的基础与关键。所有涉及中文内容处理的系统,如果没有一个好的中文词法分析系统支持,正确率都会受很大影响。具体来说,汉语自动智能分词的主要应用领域包括:
中文词法分析又是一个非常困难的问题,其难点主要体现在以下几方面:
虽然汉语词法分析的研究已经有了很长的历史,但在很多应用系统中,速度快的系统分词准确性不能达到实用化要求,准确率高的系统往往使用了大量的知识库,速度不能达到大规模应用的要求。 →『返回TOP』 [ICTCLAS3.0 2006 : Chapter 1 - Introduction / 2]
中国科学院计算技术研究所在多年研究工作积累的基础上,研制出了汉语词法分析系统ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System),主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典。我们先后精心打造五年,内核升级6次,目前已经升级到了ICTCLAS3.0。 选择ICTCLAS3.0的五大理由: 1.综合性能最优 分词系统能否达到实用性要求主要取决于两个因素:分词精度与分析速度,这两者相互制约,难以平衡。大多数系统往往陷入“快而不准,准而不快”的窘境。我们研制出了完美PDAT大规模知识库管理技术(200510130690.3),在高速度与高精度之间取得了重大突破,该技术可以管理百万级别的词典知识库,单机每秒可以查询100万词条,而内存消耗不到知识库大小的1.5倍。基于该技术,ICTCLAS3.0分词速度单机996KB/s,分词精度98.45%,API不超过200KB,各种词典数据压缩后不到3M,是当前世界上最好的汉语词法分析器。 2.统一的语言计算理论框架 汉语分词牵涉到汉语分词、未定义词识别、词性标注以及语言特例等多个因素,大多数系统缺乏统一的处理方法,往往采用松散耦合的模块组合方式,最终模型并不能准确有效地表达千差万别的语言现象,而ICTCLAS采用了层叠隐马尔可夫模型(Hierarchical Hidden Markov Model),将汉语词法分析的所有环节都统一到了一个完整的理论框架中,获得最好的总体效果,相关理论研究发表在顶级国际会议和杂志上,从理论上和实践上都证实了该模型的先进性。 3.全方位支持各种环境下的应用开发 ICTCLAS全部采用C/C++编写,支持Linux、FreeBSD及Windows系列操作系统,支持C/C++/C#/Delphi/Java等主流的开发语言; 4.应需而变,量身定做 所有功能模块均可拆卸组装,ICTCLAS有GB2312和BIG5版本,可分别处理目简繁体中文;支持当前广泛承认的分词和词类标准,包括计算所词类标注集ICTPOS3.0,北大标准、滨州大学标准、国家语委标准、台湾“中研院”、香港“城市大学”;用户可以直接自定义输出的词类标准,定义输出格式;用户可以根据自己的需求,进行量身自助式定做适合自己的分词系统。 5.国内和国际权威的公开评测、三万客户的认可 有些公司为了商业目的,关门自测,自称准确度99.50%,没有介绍测试环境和测试方法,封闭测试或者小规模的开放测试准确度100%都不足为奇的,ICTCLAS1.0在国内973专家组组织的评测中活动获得了第一名,ICTCLAS2.0在第一届国际中文处理研究机构SigHan组织的评测中都获得了多项第一名,具体的参见系统评测部分。这些都是权威机构进行大规模现场开放测试的结果,真实可信。 目前,ICTCLAS已经向国内外的企业和学术机构颁发了30,000多份授权,其中包括3721、NEC、中华商务网、硅谷动力、云南日报等企业,新疆大学、清华大学、华南理工、麻省大学:同时,ICTCLAS广泛地被《科学时报》、《人民日报》海外版、《科技日报》等多家媒体报道。您可以访问Google进一步了解ICTCLAS的应用情况。 我们欢迎相关领域的工程技术人员、研究人员使用,并提供宝贵意见。 →『返回TOP』 [ICTCLAS3.0 2006 : Chapter 1 - Introduction / 3]
2002年7月6日,ICTCLAS参加了国家973英汉机器翻译第二阶段的开放评测,测试结果如下:
表3. ICTCLAS在973评测中的测试结果 说明:
为了比较和评价不同方法和系统的性能,第四十一届国际计算语言联合会(41st Annual Meeting of the Association for Computational Linguistics, 41th ACL )下设的汉语特别兴趣研究组(the ACL Special Interest Group on Chinese Language Processing, SIGHAN;www.sighan.org) 于2003年4月22日至25日举办了第一届国际汉语分词评测大赛(First International Chinese Word Segmentation Bakeoff)[28]。报名参赛的分别是来自于大陆、台湾、美国等6个国家和地区,共计19家研究机构,最终提交结果的是12家参赛队伍。 大赛采取大规模语料库测试,进行综合打分的方法,语料库和标准分别来自北京大学(简体版)、宾州树库(简体版)、香港城市大学(繁体版),台湾“中央院” (繁体版)。每家标准分两个任务(Track):受限训练任务(Close Track)和非受限训练任务(Open Track)。 ICTCLAS分别参加了简体的所有四项任务,和繁体的受限训练任务。其中在宾州树库受限训练任务中综合得分0.881[28],名列第一;北京大学受限训练任务中综合得分0.951[28],名列第一;北京大学受限训练任务中综合得分0.953[28],名列第二。值得注意的是,我们在短短的两天之内,采取ICTCLAS简体版的内核代码,将多层隐马模型推广到繁体分词当中,同样取得了0.938[28]的综合得分。
我们利用了《人民日报》1998年1月的新闻纯文本语料进行开放测试,ICTCLAS3.0测试的精度与速度如下表所示:
→『返回TOP』 [ICTCLAS3.0 2006 : Chapter 1 - Introduction / 4]
ICTCLAS大事记:
→『返回TOP』 [ICTCLAS3.0 2006 : API functions 2]
[ICTCLAS3.0 2006 : API functions 2/1]
OS and program languages support↓
Libraries and Environment Supported↓
→『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/2]
Init the analyzer and prepare necessary data for ICTCLAS according the configure file. bool ICTCLAS_Init(const char * sInitDirPath=NULL);
Return Value Return true if init succeed. Otherwise return false. Parameters sInitDirPath: Initial Directory Path, where file Configure.xml and Data directory stored. the default value is NULL, it indicates the initial directory is current working directory path Remarks The ICTCLAS_Init function must be invoked before any operation with ICTCLAS. The whole system need call the function only once before starting ICTCLAS. When stopping the system and make no more operation, ICTCLAS_Exit should be invoked to destroy all working buffer. Any operation will fail if init do not succeed. ICTCLAS_Init fails mainly because of two reasons: 1) Required data is incompatible or missing 2) Configure file missing or invalid parameters. Moreover, you could learn more from the log file ictclas.log in the default directory. Example #include "ICTCLAS30.h" int main(int argc, char* argv[]) Output See Also ICTCLAS_Exit, ICTCLAS configure →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/3]
Exit the program and free all resources and destroy all working buffer used in ICTCLAS. bool ICTCLAS_Exit();
Return Value Return true if succeed. Otherwise return false. Parameters none Remarks The ICTCLAS_Exit function must be invoked while stopping the system and make no more operation. And call ICTCLAS_Init function to restart ICTCLAS. Example #include "ICTCLAS30.h" int main(int argc, char* argv[]) Output See Also ICTCLAS_Init →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/4]
ICTCLAS_API const result_t * ICTCLAS_ParagraphProcessA(const char *sParagraph,int *pResultCount);
Return Value the pointer of result vector, it is managed by system, user cannot alloc and free it struct result_t{ sParagraph: The source paragraph Remarks The ICTCLAS_ParagraphProcessA function works properly only if ICTCLAS_Init succeeds. Example #include "ICTCLAS30.h" int main(int argc, char* argv[]) Output See Also ICTCLAS_Init, ICTCLASConfigure →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/5]
Process a paragraph, and return the result buffer pointer
Return Value Return the pointer of result buffer. Parameters sParagraph: The source paragraph The ICTCLAS_ParagraphProcess function works properly only if ICTCLAS_Init succeeds. Example #include "ICTCLAS30.h" int main(int argc, char* argv[]) Output See Also ICTCLAS_ParagraphProcessA, ICTCLAS_Init, ICTCLAS Configure →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/6]
Process a text file bool ICTCLAS_FileProcess(const char *sSourceFilename,const char *sResultFilename);
Return Value Return true if processing succeed. Otherwise return false. Parameters sSourceFilename: The source file name to be
analysized; Remarks The ICTCLAS_FileProcess function works properly only if ICTCLAS_Init succeeds. The output format is customized in ICTCLAS configure. Example #include "ICTCLAS30.h" int main(int argc, char* argv[]) if(!ICTCLAS_Init()) Output See Also ICTCLAS_Init, ICTCLAS Configure →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/7] [ICTCLAS3.0 2006 : API functions 2/8]
Import user-defined dictionary from a text file. bool unsigned int ICTCLAS_ImportUserDict(const char *sFilename);
Return Value The number of lexical entry imported successfully Parameters sFilename: Text filename for user dictionary Remarks The ICTCLAS_ImportUserDict function works properly only if ICTCLAS_Init succeeds. The text dictionary file foramt see User-defined Lexicon. You only need to invoke the function while you want to make some change in your customized lexicon or first use the lexicon. After you import once and make no change again, ICTCLAS will load the lexicon automatically if you set UserDict "on" in the configure file. While you turn UserDict "off", user-defined lexicon would not be applied. Example #include <string.h> int main(int argc, char* argv[]) Output #include "ICTCLAS30.h" See Also ICTCLAS_Init, ICTCLAS Configure User-defined Lexicon →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/9]
Search in the lexicon, and determin whether the word is listed in the core lexicon. bool ICTCLAS_IsWord(const char *sWord);
Return Value Return true if exists. Otherwise return false. Parameters sWord:word to be searched. Remarks The ICTCLAS_IsWord function works properly only if ICTCLAS_Init succeeds. Example #include "ICTCLAS30.h" int main(int argc, char*
argv[]) Output 人民 exists 人民王 not exists →『返回TOP』 [ICTCLAS3.0 2006 : API functions 2/10]
Get the unigram probability of a given word float ICTCLAS_GetUniProb(const char *sWord);
Return Value Return the unigram probability after some simple smoothing technique. Parameters sWord:word to be queried. Remarks The ICTCLAS_GetUniProb function works properly only if ICTCLAS_Init succeeds. Example #include "ICTCLAS30.h" int main(int argc, char*
argv[]) Output See Also ICTCLAS_Init →『返回TOP』 [ICTCLAS3.0 2006 : Configure 3]
ICTCLAS Configure is stored in Configure.xml. The parameter lists as follows: <?xml version="1.0"
encoding="GB2312"?> →『返回TOP』 [ICTCLAS3.0 2006 : Example 4]
[ICTCLAS3.0 2006 : ICTCLAS example 4/1] Example #include <stdio.h> #include <string.h> #include "ICTCLAS30.h" int main(int argc, char* argv[]) { //Sample1: Sentence or paragraph lexical analysis with only one result char sSentence[2000]; const char *sResult; if(!ICTCLAS_Init()) { printf("Init fails\n"); return -1; } printf("Input sentence now!\n"); scanf("%s",sSentence); while(_stricmp(sSentence,"q")!=0) { sResult=ICTCLAS_ParagraphProcessA(sSentence); printf("%s\nInput string now!\n",sResult); scanf("%s",sSentence); }
//Sample2: File segmentation and POS tagging ICTCLAS_FileProcess("G:/ICTCLAS26/Test/Test.txt","G:/ICTCLAS26/Test/Test_result.txt"); //Sample3:Judge whether the word exists in the core dictionary bool bRtn=ICTCLAS_IsWord("人民"); printf("人民"); if(!bRtn) printf(" not"); printf(" exists\n");
bRtn=ICTCLAS_IsWord("人民王"); printf("人民王"); if(!bRtn) printf(" not"); printf(" exists\n");
//Sample3: User-defined dictionary sResult=ICTCLAS_ParagraphProcessA("1989年春夏之交的政治风波1989年政治风波24小时降雪量24小时降雨量863计划ABC防护训练APEC会议BB机BP机C2系统C3I系统C3系统C4ISR系统C4I系统CCITT建议"); printf("Before Adding User-defined lexicon, the result is:\n%s\n",sResult); unsigned int nItems=ICTCLAS_ImportUserDict("Data/UsrLexicon.txt");//Import user-defined dictionary printf("%d user-defined lexical entries added!\n",nItems); sResult=ICTCLAS_ParagraphProcessA("1989年春夏之交的政治风波1989年政治风波24小时降雪量24小时降雨量863计划ABC防护训练APEC会议BB机BP机C2系统C3I系统C3系统C4ISR系统C4I系统CCITT建议"); printf("After Adding User-defined lexicon, the result is:\n%s\n",sResult);
ICTCLAS_Exit(); return 0; } See ICTCLAS API; Platforms and Compatibility →『返回TOP』
Example using System; using System.Runtime.InteropServices; namespace ICTCLASDemo using System.Runtime.InteropServices; namespace ICTCLASDemo [DllImport("ICTCLAS30.dll",CharSet=CharSet.Ansi)] [DllImport("ICTCLAS30.dll",CharSet=CharSet.Ansi)] /// <summary> See ICTCLAS API; Platforms and Compatibility →『返回TOP』 [ICTCLAS3.0 2006 : ICTCLAS example 4/2]
How to compile ICTCLAS lib in Linux/Unix/ FreeBSD You must program the application with C/C++. Example as in the previous section。 With the ICTCLAS lib libICTCLAS.a, you could compile the program (i.e. file name Example.cpp) with the following Makefile. Makefile test: Example.cpp ICTCLAS30.h See ICTCLAS API; Platforms and Compatibility →『返回TOP』 [ICTCLAS3.0 2006 : ICTCLAS example 4/3]
You must register ICTCLAS com firstly. Example as follows。 <% See ICTCLAS API; Platforms and Compatibility →『返回TOP』 [ICTCLAS3.0 2006 : ICTCLAS example 4/4]
procedure TForm1.btnInitClick(Sender: TObject); if (@initCls=nil)
then if (@exitCls=nil)
then if (@processCls=nil)
then initCls; See ICTCLAS API; Platforms and Compatibility →『返回TOP』
public class
testICTCLAS public static void main(String[]
args) testICTCLAS test = new
testICTCLAS(); See ICTCLAS API; Platforms and Compatibility →『返回TOP』 [ICTCLAS3.0 2006 : User-defined lexicon/ 5]
第一次加载用户词典或者需要变更用户词典的时候,直接调用 ImportUserDict(const char *sFilename)即可。下次使用同一部用户词典,不需要再调用该函数,如果在配置文件中设置<UserDict>On</UserDict>,系统会自动加载。在配置文件中设置<UserDict>Off</UserDict>,系统会自动将用户词典屏蔽。 用户词典格式 如何在用户词典中注释 用户词典示例 /*1989年春夏之交的政治风波
//政治术语 →『返回TOP』 [ICTCLAS3.0 2006 : Authors/ 6]
张华平 博士 Dr. Kevin Zhang (张华平,Zhang Hua-Ping) →『返回TOP』 [ICTCLAS3.0 2006 : Related source 7]
2. 词性标注集说明书 (Part-Of-Speech Set Document) 4. ICTCLAS for Linux 5. ICTCLAS JNI for Java 6. ICTCLAS JAR for Lucene 7. ICTCLAS 繁体版本分词 →『返回TOP』 [ICTCLAS3.0 2006 : System log 9]
参照统一的授权激活程序 [ICTCLAS3.0 2006 : Reference 11]
9. 部分重点客户
[ICTCLAS3.0 2006 : Reference 11]
Hua-Ping give acknowledgements to: 1. Associate Prof. Qun Liu for his guide to Chinese natural language processing, detail assistance in the whole procedures, including ICTCLAS system design and coding. 2. My Ph.D supervisor Shuo Bai for his tutoring and care in the last years; My M. Sc supervisor and our division director Xueqi Cheng for his help and support from my first day in the software division. 3. Prof. Shiwen Yu of Peking University for the training corpus. 4. Hongkui Yu for his excellent work in implement of organization recognition using role model, and some more trivial work including corpus transformation. 5. Gang Zou for his excellent work in automatic assessment on lexical result. 6. Thanks the support and instructive discussion from Dr. Bin Wang, Dr. Jian Sun, Mr. Weihua Luo, Jifeng Li and other colleagues in LLC(large content computing) group. 7. All ICTCLAS users (no less than 30,000 till now) for their toleration, active and helpful reaction. Especially thanks the commercial entities for the practical advice and new requirement. 8. Huaping Zhang would especially express gratitude to his graceful
girl friend Feifei and her family for their encouragement during the hard
work. →『返回TOP』 [ICTCLAS3.0 2006 : Reference 11]
11 参考文献[1]ZHANG Hua-Ping et al; Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , Second SIGHAN workshop affiliated with 41th ACL; Sapporo Japan, July, 2003, pp. 63-70 [2]ZHANG Hua-Ping et al; HHMM-based Chinese Lexical Analyzer ICTCLAS, Second SIGHAN workshop affiliated with 41th ACL; Sapporo Japan, July, 2003, pp. 184-187 [3]ZHANG Hua-Ping, LIU Qun, et al. Chinese Name Entity Recognition Using Role Model. Special issue "Word Formation and Chinese Language processing" of the International Journal of Computational Linguistics and Chinese Language Processing, vol.8, No.2, 2003, pp. 29-60 [4]ZHANG Hua-Ping, LIU Qun; Automatic Recognition of Chinese Person Name based on Role-Tagging;(accepted) Chinese Journal of Computer, 2003 [5]Kevin Zhang(Zhang Hua-Ping), Qun Liu, Hao Zhang, Xueqi Cheng. Automatic Recognition of Chinese Unknown Words Based on Role Tagging, First SIGHAN affiliated with 19th COLING, 2002-9 pp71-77; [6]ZHANG Hua-Ping, LIU Qun. Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method; Journal of Chinese Information Processing, 2002-9, Vol.16(5):pp.1-pp.7; [7]ZHANG Hua-Ping, LIU Qun. Automatic Recognition of Chinese Person Name based on Role-Tagging, Proc. of 7th Conference of Computer Science for Graduate Student in CAS, 2002-7, Si Chuan [8]YU Hong-Kui, ZHANG Hua-Ping, etc. Automatic Recognition of Chinese Organization based on Role-Tagging, Proc. of 20th International Conference on Computer Processing of Oriental Languages , 2003-8, pp79-87, ShenYang [9]LIU Qun, ZHANG Hua-Ping; Chinese Lexical Analysis Using Hierarchical Hidden Markov Model;(accepted) Chinese Journal of Computer Research and Development, 2003 [10]Richard Sproat, Thomas Emerson. The First International Chinese Word Segmentation Bakeoff, First SIGHAN Workshop attached with the ACL2003, 2003.7, pp.133-143 [11] 张璋, 中文自然语言资源共享开辟新路, 科学时报,2003.2.21 [12] 张璋, 中文LDC推动汉语资源共享, 科学时报,2003.1.29 [13] 中新网, 中文语言资源共享, 中国科学家推出处理新技术, 2002 [14] 新华网, 中科院15项产品和技术免费转让引起关注, 2002.9. 9 [15] 新浪网,中科院科研成果免费送达中小企业 →『返回TOP』 [ICTCLAS3.0 2006 : Manual log 8] 1. Kevin Zhang 2003-11-27 Manual Version 1.0 created 2..Kevin Zhang 2003-11-27 Manual Version 1.6 modified; adding new API functions, user-defined lexicon, manual log and system log, adding ICTPOS3.0 tagset 3. Kevin Zhang 2006-12-1 Manual Version 2.0 modified; →『返回TOP』
Copyright © 2001-2006 中科计算技术转移中心 中国科学院计算技术研究所 版权所有 All Rights Reserved | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||