The number of pages within the document is: 38
The self-declared author(s) is/are:
Original authors did not specify.
The subject is as follows:
Original authors did not specify.
The original URL is: LINK
The access date was:
2019-02-13 13:32:09.074382
Please be aware that this may be under copyright restrictions. Please send an email to admin@pharmacoengineering.com for any AI-generated issues.
The content is as follows:
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see creativecommons.org/licenses/by/3.0/.This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2781721, IEEETransactions on Knowledge and Data EngineeringSELF-TUNEDDESCRIPTIVEDOCUMENTCLUSTERINGUSINGAPREDICTIVENETWORK,VOL.X,NO.X,XXXXXXX1 Self-TunedDescriptiveDocumentClustering usingaPredictiveNetwork AustinJ.Brockmeier, Member,IEEE ,TingtingMu, Member,IEEE ,SophiaAnaniadou,andJohnY.Goulermas, SeniorMember,IEEE AbstractÑDescriptiveclusteringconsistsofautomaticallyorganizingdatainstancesintoclustersandgeneratingadescriptive summaryforeachcluster.Thedescriptionshouldinformauseraboutthecontentsofeachclusterwithoutfurtherexaminationofthe speciÞcinstances,enablingausertorapidlyscanforrelevantclusters.Selectionofdescriptionsoftenreliesonheuristiccriteria.We modeldescriptiveclusteringasanauto-encodernetworkthatpredictsfeaturesfromclusterassignmentsandpredictscluster assignmentsfromasubsetoffeatures.Thesubsetoffeaturesusedforpredictingaclusterservesasitsdescription.Fortext documents,theoccurrenceorcountofwords,phrases,orotherattributesprovidesasparsefeaturerepresentationwithinterpretable featurelabels.Intheproposednetwork,clusterpredictionsaremadeusinglogisticregressionmodels,andfeaturepredictionsrelyon logisticormultinomialregressionmodels.Optimizingthesemodelsleadstoacompletelyself-tuneddescriptiveclusteringapproach thatautomaticallyselectsthenumberofclustersandthenumberoffeatureforeachcluster.Weappliedthemethodologytoavarietyof shorttextdocumentsandshowedthattheselectedclustering,asevidencedbytheselectedfeaturesubsets,areassociatedwitha meaningfultopicalorganization. IndexTerms ÑDescriptiveclustering,featureselection,logisticregression,modelselection,sparsemodels !1INTRODUCTION EXPLORATORY dataanalysistechniquessuchascluster- ingcanbeusedtoidentifysubsetsofdatainstances withcommoncharacteristics.Userscanthenexplorethe databyexaminingsomeinstancesineachcluster,rather thanexamininginstancesfromthefulldataset.Thisen- ablesuserstoefÞcientlyfocusonrelevantsubsetsoflarge datasets,especiallyforcollectionsofdocuments[1].Inpar- ticular, descriptiveclustering consistsofautomaticallygroup- ingsetsofsimilarinstancesintoclustersandautomatically generatingahuman-interpretabledescriptionorsummary foreachcluster.EachclusterÕsdescriptionallowsauserto ascertaintheclusterÕsrelevancewithouthavingtoexamine itscontents.Fortextdocuments,asuitabledescriptionfor eachclustermaybeamulti-wordlabel,extractedtitle,ora listofcharacteristicwords[2].Thequalityoftheclustering isimportant,suchthatitalignswithauserÕsideaofsimi- larity,butitisequallyimportanttoprovideauserwithan informativeandconcisesummarythataccuratelyreßects thecontentsofthecluster.However,objectivecriteriafor evaluatingthedescriptionsasawhole,whichdonotresort tohumanevaluation,havebeenlargelyunexplored. WiththeaimofdeÞninganobjectivecriterion,wecon- sideradirectcorrespondencebetweendescriptionandpre- diction.Weassumeeachinstanceisrepresentedwithsparse features(suchasabagofwords),andeachclusterwillbe describedbyasubsetoffeatures.AclusterÕsdescription ¥ThisresearchwaspartiallysupportedbyMedicalResearchCouncilgrants MR/L01078X/1andMR/N015665/1. ¥A.J.BrockmeierandY.G.GoulermasarewiththeSchoolofElectricalEn- gineering,Electronics&ComputerScience,UniversityofLiverpool,Liv- erpool,L693BX,UnitedKingdom.(e-mail:j.y.goulermas@liverpool.ac.uk) ¥T.MuandS.AnaniadouarewiththeSchoolofComputerScience, UniversityofManchester,Manchester,M17DN,UnitedKingdom. DateofcurrentversionNovember18,2017. shouldsummarizeitscontents,suchthatthedescription aloneshouldenableausertopredictwhetheranarbitrary instancebelongstoaparticularcluster.Likewise,amachine classiÞertrainedusingthefeaturessubsetshouldalsobe predictiveoftheclustermembership.TheclassiÞcationac- curacyprovidesanobjectiveandquantitativecriterionto compareamongdifferentfeaturesubsets. Toserveasaconcisedescription,thenumberoffea- turesusedbytheclassiÞermustbelimited(e.g.,alinear classiÞerthatusesallfeaturesisnoteasilyinterpretable). Arelativelysmallsetofpredictivefeaturescanbeiden- tiÞedusingvariousfeatureselectionmethods[3],[4].In particular,weidentifyfeaturessubsetsbyvariousstatisti- calandinformation-theoreticcriteria[5]andbytraining linearclassiÞerswithadditionalsparsity-inducingregular- izations[6],[7],[8],e.g.,the !1-normfortheLasso[9]or acombinationof !1and!2-normsfortheElasticNet[10], suchthatonlyasmallsetoffeatureshavenon-zerocoefÞ- cients.Inasimilarspirit,Lassohasbeenusedforselecting predictivefeaturesforexplainingclassiÞcationmodels[11]. Inadditiontothecardinalityconstraintonthenumber offeatures,weonlypermitfeaturesthatarepositively correlatedwithagivencluster,i.e.,featureswhosepresence areindicativeofthecluster.Thisconstraintensuresthatno clusterisdescribedbytheabsenceoffeatures,whichare presentinotherclusters.Forinstance,givenacorpusof bookandmoviereviews,thepositivityconstraintavoids aclusterconsistingofmainlyofbookreviewsfrombeing describedas Âmovie,i.e.,theabsenceofthewordfea- ture movie.Ingeneral,thisconstraintcanbeenforcedby admittingonlyfeaturesthatarepositivelycorrelatedwith aparticularcluster;forlinearclassiÞers,thiscanbedone byenforcingtheconstraintthatthecoefÞcientsarenon- negative[12].
Please note all content on this page was automatically generated via our AI-based algorithm (6mjADA3EZ7xdo88dTp64). Please let us know if you find any errors.