NIT3171
ICT Business Analyticsand DataVisualization
Data Mining(I)
Outline
NIT3171 – ICT Business Analytics &DataVisualization 2
Data MiningConcepts
NIT3171 – ICT Business Analytics &DataVisualization 3
DataMining
• Datamininghasattractedagreatdealofattentioninthe
informationindustryandinsocietyasawholeinrecentyears.
• The wide availability of huge amounts of data and the imminentneedforturningsuchdataintousefulinformation andknowledge.
• Dataminingcanbeviewedasaresultofthenaturalevolution of informationtechnology.
NIT3171 – ICT Business Analytics &DataVisualization 4
DataMining
• Data mining refersto
extractingorminingknowledgefromlargeamountsofdata.
• DataMiningisalsotreatedasknowledgediscoveryindatabase
(KDD).
• Knowledge:non-trivial,implicit,previouslyunknownand potentiallyusefulinformationorpatterns
NIT3171 – ICT Business Analytics &DataVisualization 5
KnowledgeDiscoveryProcess
- Datacleaning
– Toremovenoiseandinconsistent data - Dataintegration
– Wheremultipledatasourcesmay becombined - Dataselection
– Where data relevant to the analysistaskareretrievedfromthe database
NIT3171 – ICT Business Analytics &DataVisualization 6
KnowledgeDiscoveryProcess
- Datatransformation
– Where data aretransformed or consolidated into forms appropriate for mining by performing summary or aggregationoperations - Datamining
– An essential process where intelligentmethodsareappliedin ordertoextractdatapatterns
NIT3171 – ICT Business Analytics &DataVisualization 7
KnowledgeDiscoveryProcess
- Patternevaluation
– To identifying the trulyinteresting patternsrepresentingknowledge basedonsomeinterestingness measures - Knowledgepresentation
– Where visualization and knowledge representation techniquesareusedtopresentthe minedknowledgetotheuser
NIT3171 – ICT Business Analytics &DataVisualization 8
KnowledgeDiscoveryProcess
• Step1-4aredifferentformsof datapreprocessing,wherethe dataarepreparedformining.
• Step5interactswiththeuseror a knowledge base. The interesting patterns are presentedtotheuserandmaybe storedasnewknowledgeinthe knowledgebase.
NIT3171 – ICT Business Analytics &DataVisualization 9
Data MiningSystems
• Major components inDMS
– Database, data warehouse, WorldWideWeb,orother informationrepository
– Databaseordatawarehouse server
– Knowledgebase
– Data miningengine
– Pattern evaluationmodule
– Userinterface
NIT3171 – ICT Business Analytics &DataVisualization 10
Data MiningSystems
• Database,datawarehouse,World WideWeb,orotherinformation repository
– Thisisoneorasetofdatabases,data warehouses, spreadsheets, or other kindsofinformationrepositories.
– Data cleaning and data integration techniquesmaybeperformedonthe data.
NIT3171 – ICT Business Analytics &DataVisualization 11
Data MiningSystems
• Database or datawarehouse
server
– The database or data warehouse serverisresponsibleforfetchingthe relevantdata,basedontheuser’s data miningrequest.
• Knowledgebase
– Thisisthedomainknowledgethatis usedtoguidethesearchorevaluate the interestingness of resulting patterns.
NIT3171 – ICT Business Analytics &DataVisualization 12
Data MiningSystems
• Data miningengine
– Thisisessentialtothedatamining systemandideallyconsistsofasetof functionalmodulesfortaskssuchas characterization, association and correlation analysis, classification, prediction, cluster analysis outlier analysisandevolutionanalysis.
NIT3171 – ICT Business Analytics &DataVisualization 13
Data MiningSystems
• Pattern evaluationmodule
– This component typically employs interestingnessmeasuresandinteractswith thedataminingmodulessoastofocusthe searchtowardinterestingpatterns.Itmay useinterestingnessthresholdstofilterout discoveredpatterns.
– Thepatternevaluationmodulemaybe integrated with the mining module, dependingontheimplementationofthe dataminingmethodused.
NIT3171 – ICT Business Analytics &DataVisualization 14
Data MiningSystems
• Userinterface
– This module communicates between users and the data mining system, allowing the user to interact with the systembyspecifyingadataminingquery or task, providing information to help focus the search, and performing exploratory data mining based on the intermediatedataminingresults.
– This component allows the user to browsedatabaseanddatawarehouse schemasordatastructures,evaluate mined patterns, and visualize the patterns in differentforms.
NIT3171 – ICT Business Analytics &DataVisualization 15
Data MiningSystems
NIT3171 – ICT Business Analytics &DataVisualization 16
Data MiningFunctionalities
NIT3171 – ICT Business Analytics &DataVisualization 17
Data MiningFunctionalities
• Patterns to bediscovered
– Concept/classdescription:characterizationanddiscrimination
– Frequentpatterns,associations,andcorrelations
– Classification andprediction
– Clusteranalysis
– Outlieranalysis
– Evolutionanalysis
NIT3171 – ICT Business Analytics &DataVisualization 18
Concept/ClassDescription:
CharacterizationandDiscrimination
• Datacharacterization
– Bysummarizingthedataoftheclassunderstudyingeneralterms
• Datadiscrimination
– Bycomparisonofthetargetclasswithoneorasetofcomparative classes
NIT3171 – ICT Business Analytics &DataVisualization 19
Frequentpatterns,associations,and
correlations
• Frequentpatterns
– Patternsthatoccurfrequentlyindata.
– Canbeitemsets,subsequences,andsubstructures
• Forexample,frequentitemsetsaresetsofitemsthatfrequentlyappear togetherinatransactionaldataset
NIT3171 – ICT Business Analytics &DataVisualization 20
Classificationandprediction
• Classificationistheprocessoffindingamodel(orfunction) thatdescribesanddistinguishdataclassesorconcepts,for thepurposeofbeingabletousethemodeltopredictthe classofobjectswhoseclasslabelisunknown.
• Classificationpredictscategorical(discrete,unordered)labels.
• Predictionmodelscontinuous-valuedfunctionsareusedto predictmissingorunavailablenumericaldatavaluesrather than classlabels.
NIT3171 – ICT Business Analytics &DataVisualization 21
Clusteranalysis
• Clusteringanalyzesdataobjectswithoutconsultingaknown
class label.
• Ingeneral,theclasslabelsarenotpresentinthetrainingdata simplybecausetheyarenotknowntobeginwith.
• Theobjectsareclusteredorgroupedbasedontheprincipleof maximizing the intraclass similarity and minimizing the interclasssimilarity.
NIT3171 – ICT Business Analytics &DataVisualization 22
Outlieranalysis
• Outliersaredataobjectsthatdonotcomplywiththegeneral
behavior or model of the data in a database.
• Normally,theseoutliersaretreatedasnoiseorexceptions, buttherareeventscanbemoreinterestingthanthemore regularly occurringones.
NIT3171 – ICT Business Analytics &DataVisualization 23
EvolutionAnalysis
• Dataevolutionanalysisdescribesandmodelsregularitiesor
trendsfor objectswhosebehaviorchangesover time.
• Thismayincludecharacterization,discrimination,association and correlation analysis, classification, prediction, or clusteringoftime-relateddata,distinctfeaturesofsuchan analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
NIT3171 – ICT Business Analytics &DataVisualization 24
InterestingPatterns
• Apatternisinterestingifitis
– Easily understood byhumans
– Validonnewortestdatawithsomedegreeofcertainty
– Potentiallyuseful
– Novel
– abletovalidateahypothesisthattheusersoughttoconfirm
– Representing aknowledge
NIT3171 – ICT Business Analytics &DataVisualization 25
DataMiningModels&Tasks
NIT3171 – ICT Business Analytics &DataVisualization 26
Data MiningModels
NIT3171 – ICT Business Analytics &DataVisualization 27
Classification
• Goal:previouslyunseenrecordsshouldbeassignedaclassas
accurately aspossible.
• Findamodel forclassattributeasafunctionofthevaluesof otherattributes.
• Givenacollectionofrecords(trainingset)
– Eachrecordcontainsasetofattributes,oneoftheattributesisthe
class.
• Atestsetisusedtodeterminetheaccuracyofthemodel. Usually,thegivendatasetisdividedintotrainingandtest sets,withtrainingsetusedtobuildthemodelandtestset used to validateit.
NIT3171 – ICT Business Analytics &DataVisualization 28
Classification
NIT3171 – ICT Business Analytics &DataVisualization 29
Classification
NIT3171 – ICT Business Analytics &DataVisualization 30
Regression
• Predictfuturevaluesbasedonpastvalues
• Linear Regression assumes linear relationship exists. y=c0+c1x1+…+cnxn
• Findvaluestobestfitthedata
NIT3171 – ICT Business Analytics &DataVisualization 31
Regression
NIT3171 – ICT Business Analytics &DataVisualization 32
Clustering
• Givenasetofdatapoints,eachhavingasetofattributes,and
asimilaritymeasureamongthem,findclusterssuchthat
– Datapointsinoneclusteraremoresimilartooneanother.
– Datapointsinseparateclustersarelesssimilartooneanother.
• SimilarityMeasures:
– EuclideanDistanceifattributesarecontinuous.
– Other Problem-specificMeasures.
NIT3171 – ICT Business Analytics &DataVisualization 33
Clustering
NIT3171 – ICT Business Analytics &DataVisualization 34
Clustering
NIT3171 – ICT Business Analytics &DataVisualization 35
Association RuleMining
• Givenasetofrecordseachofwhichcontainsomenumberof
items from a given collection;
– Producedependencyruleswhichwillpredictoccurrenceofanitem basedonoccurrencesofotheritems.
NIT3171 – ICT Business Analytics &DataVisualization 36
Correlation
• Examinethedegreetowhichthevaluesfortwovariables
behavesimilarly.
• Correlation coefficientr:
1 = perfectcorrelation
-1=perfectbutoppositecorrelation
0 = nocorrelation
NIT3171 – ICT Business Analytics &DataVisualization 37
42