博士学位文模板2016年

更新时间:2024-03-28 作者:用户投稿原创标记本站原创 点赞:4554 浏览:16269

作者简介及博士学位论文中英文摘 要

论文题目:正则化潜在语义索引:一种新型大规模话题建模方法

作者简介:,,19年月出生,20年9月师从于教授,于20年7月获博士学位.

本文研究大规模文本数据上的话题建模方法.具体地,本文提出了三个完全可分的大规模话题建模方法,包括正则化潜在语义索引(RLSI),在线正则化潜在语义索引(OnlineRLSI)和分组正则化潜在语义索引(GroupRLSI).

RLSI以矩阵分解作为话题建模方法主体,同时加上特定正则化因子以满足不同建模需求并控制模型复杂度.RLSI的优点在于它以矩阵分解作为模型主体,从而自然地继承了矩阵分解方法完全可分,高度易并行的特性,极易实现并行或分布式处理.实验表明,RLSI的话题建模效果与现有话题建模方法相当,但是经过简单的分布式处理之后,RLSI能够比现有分布式话题建模方法更加高效地处理更大规模的数据,真正意义上实现大规模数据上的话题建模.


OnlineRLSI是RLSI的在线学习拓展,其核心思想是按照时间顺序对文本进行分批处理.在线学习使得在整个处理过程中只有少部分数据需要载入内存进行运算与分析,从而达到进一步降低RLSI内存消耗的目的.同时,在线学习能够敏感地捕捉文本内容随着时间的变化情况,提取出的话题也具有相应的动态特征.实验表明,在内存有限的情况下,OnlineRLSI能够比RLSI扩展到更大规模的数据.同时OnlineRLSI能够敏感地捕捉到话题随着时间的变化情况,有效实现动态话题建模.

GroupRLSI是RLSI的又一拓展,其核心思想是利用文本既有的类标签信息对文本进行分组,组与组之间尽量独立进行处理.这种分组处理方式可以将原始RLSI中的大规模问题拆解成一系列小规模问题独立求解,从而达到进一步提升RLSI计算效率的目的.同时,分组处理使得提取出的话题粒度更小,能够更加细致地反映文本的局部特征.实验表明,在同等数据规模下,GroupRLSI的计算效率远高于RLSI,并且随着话题总数的增加,这种优势越发明显.同时,GroupRLSI提取出的话题能够更加细致地刻画文本的局部特征,是一种更加准确的话题建模方法.

关 键 词:话题建模,矩阵分解,并行/分布式处理

RegularizedLatentSemanticIndexing:ANew

ApproachtoLarge-ScaleTopicModeling

WANGQuan

ABSTRACT

Topicmodelingaimstoautomaticallydiscoverthelatenttopicsinadocumentcollectionaswellasrepresentthedocumentswiththediscoveredtopics.Itprovidesapowerfulwaytobetterunderstandaswellasbetterrepresentthecontentofdocuments.Nowadays,ithasbeeapopulartoolinvarioustextminingtasks,suchastextclassification,textclustering,andinformationretrieval.Inreal-worldapplications,however,theusefulnessoftopicmodelingislimitedduetoscalabilityissue.Scalingtolargerdocumentcollectionsviaparallelordistributedprocessingisanactiveareaofresearch,butmostsolutionscannotreducememorycostduringtopicmodeling,andthusrequiredrasticstepssuchasvastlyreducinginputvocabulary.Howtoeffectivelyandefficientlyperformtopicmodelingonlarge-scaledocumentcollectionsremainsthebiggestchallenge.

Inthisthesis,Istudytheproblemoflarge-scaletopicmodeling.Specifically,Iproposethreefully-deposablelarge-scaletopicmodelingapproaches,includingRegularizedLatentSemanticIndexing,OnlineRegularizedLatentSemanticIndexing,andGroupRegularizedLatentSemanticIndexing,referredtoasRLSI,OnlineRLSI,andGroupRLSI,respectively.

RLSIformalizestopicmodelingasamatrixfactorizationproblem,withspecificregularizationtermstomeetdifferentmodelingrequirementsanddealwithover-fitting.Thiormulationallowsthelearningproblemtobedeposedintomultiplesub-problems,whichcanbeprocessedinparallelordistributedmode.ExperimentalresultsshowthatRLSIisaseffectiveasexistingtopicmodelingapproaches.Ontheotherhand,byutilizingdistributedprocessing,RLSIcaneasilyscaleuptomuchlargerdatasetsthanexistingapproaches.

OnlineRLSIisanonlineextensiontoRLSI,thekeyofwhichistotakedocumentsasstreamdataandprocessthemonebyone.Asaresult,OnlineRLSIkeepsonlyonedocumentaswellasitsrelatedinformationinmemory,andthussignificantlyreduceemorycost.Moreover,suchonlinelearningmodemakesitcapableofcapturingtheevolutionoftopics.ExperimentalresultsshowthatparedwithRLSI,OnlineRLSIcanscaleuptoevenlargerdatasetswithlimitedstorage.Inaddition,itcaneffectivelycapturetheevolutionoftopicsandcapableoftopictracking.

GroupRLSIisanotherextensiontoRLSI,thekeyofwhichistosplitdocumentsintogroupsaccordingtopre-definedclasslabelsandprocessthesedocumentgroupsasindependentlyaspossible.Asaresult,GroupRLSIdeposesthelarge-scalematrixfactorizationproblemconcerningalldocumentsintomultipleall-scaleonesconcerningonlysubsetsofdocuments,andthussignificantlyenhancesputationalefficiency.Moreover,suchgrouplearningmodemakesitcapableofdiscoveringall-granularitytopicswhichcanbettercharacterizelocalinformationinthedocumentcollection.ExperimentalresultsshowthatGroupRLSIiuchmoreefficientthanRLSI,particularlywhenthenumberoftopicsgetslarger.Inaddition,itcaneffectivelydiscoverall-granularitytopicsthatbettercharacterizeslocalinformation.Keywords:TopicModeling,MatrixFactorization,Parallel/DistributedProcessing

1