Data Mining

NIT3171
ICT Business Analyticsand DataVisualization
Data Mining(I)

Outline

NIT3171 – ICT Business Analytics &DataVisualization 2

Data MiningConcepts

NIT3171 – ICT Business Analytics &DataVisualization 3

DataMining

• Datamininghasattractedagreatdealofattentioninthe
informationindustryandinsocietyasawholeinrecentyears.
• The wide availability of huge amounts of data and the imminentneedforturningsuchdataintousefulinformation andknowledge.
• Dataminingcanbeviewedasaresultofthenaturalevolution of informationtechnology.

NIT3171 – ICT Business Analytics &DataVisualization 4

DataMining

• Data mining refersto
extractingorminingknowledgefromlargeamountsofdata.

• DataMiningisalsotreatedasknowledgediscoveryindatabase
(KDD).
• Knowledge:non-trivial,implicit,previouslyunknownand potentiallyusefulinformationorpatterns

NIT3171 – ICT Business Analytics &DataVisualization 5

KnowledgeDiscoveryProcess

  1. Datacleaning
    – Toremovenoiseandinconsistent data
  2. Dataintegration
    – Wheremultipledatasourcesmay becombined
  3. Dataselection
    – Where data relevant to the analysistaskareretrievedfromthe database

NIT3171 – ICT Business Analytics &DataVisualization 6

KnowledgeDiscoveryProcess

  1. Datatransformation
    – Where data aretransformed or consolidated into forms appropriate for mining by performing summary or aggregationoperations
  2. Datamining
    – An essential process where intelligentmethodsareappliedin ordertoextractdatapatterns

NIT3171 – ICT Business Analytics &DataVisualization 7

KnowledgeDiscoveryProcess

  1. Patternevaluation
    – To identifying the trulyinteresting patternsrepresentingknowledge basedonsomeinterestingness measures
  2. Knowledgepresentation
    – Where visualization and knowledge representation techniquesareusedtopresentthe minedknowledgetotheuser

NIT3171 – ICT Business Analytics &DataVisualization 8

KnowledgeDiscoveryProcess

• Step1-4aredifferentformsof datapreprocessing,wherethe dataarepreparedformining.
• Step5interactswiththeuseror a knowledge base. The interesting patterns are presentedtotheuserandmaybe storedasnewknowledgeinthe knowledgebase.

NIT3171 – ICT Business Analytics &DataVisualization 9

Data MiningSystems

• Major components inDMS
– Database, data warehouse, WorldWideWeb,orother informationrepository
– Databaseordatawarehouse server
– Knowledgebase
– Data miningengine
– Pattern evaluationmodule
– Userinterface

NIT3171 – ICT Business Analytics &DataVisualization 10

Data MiningSystems

• Database,datawarehouse,World WideWeb,orotherinformation repository
– Thisisoneorasetofdatabases,data warehouses, spreadsheets, or other kindsofinformationrepositories.
– Data cleaning and data integration techniquesmaybeperformedonthe data.

NIT3171 – ICT Business Analytics &DataVisualization 11

Data MiningSystems

• Database or datawarehouse
server
– The database or data warehouse serverisresponsibleforfetchingthe relevantdata,basedontheuser’s data miningrequest.

• Knowledgebase
– Thisisthedomainknowledgethatis usedtoguidethesearchorevaluate the interestingness of resulting patterns.

NIT3171 – ICT Business Analytics &DataVisualization 12

Data MiningSystems

• Data miningengine
– Thisisessentialtothedatamining systemandideallyconsistsofasetof functionalmodulesfortaskssuchas characterization, association and correlation analysis, classification, prediction, cluster analysis outlier analysisandevolutionanalysis.

NIT3171 – ICT Business Analytics &DataVisualization 13

Data MiningSystems

• Pattern evaluationmodule
– This component typically employs interestingnessmeasuresandinteractswith thedataminingmodulessoastofocusthe searchtowardinterestingpatterns.Itmay useinterestingnessthresholdstofilterout discoveredpatterns.
– Thepatternevaluationmodulemaybe integrated with the mining module, dependingontheimplementationofthe dataminingmethodused.

NIT3171 – ICT Business Analytics &DataVisualization 14

Data MiningSystems

• Userinterface
– This module communicates between users and the data mining system, allowing the user to interact with the systembyspecifyingadataminingquery or task, providing information to help focus the search, and performing exploratory data mining based on the intermediatedataminingresults.
– This component allows the user to browsedatabaseanddatawarehouse schemasordatastructures,evaluate mined patterns, and visualize the patterns in differentforms.
NIT3171 – ICT Business Analytics &DataVisualization 15

Data MiningSystems

NIT3171 – ICT Business Analytics &DataVisualization 16

Data MiningFunctionalities

NIT3171 – ICT Business Analytics &DataVisualization 17

Data MiningFunctionalities

• Patterns to bediscovered
– Concept/classdescription:characterizationanddiscrimination
– Frequentpatterns,associations,andcorrelations
– Classification andprediction
– Clusteranalysis
– Outlieranalysis
– Evolutionanalysis

NIT3171 – ICT Business Analytics &DataVisualization 18

Concept/ClassDescription:
CharacterizationandDiscrimination
• Datacharacterization
– Bysummarizingthedataoftheclassunderstudyingeneralterms
• Datadiscrimination
– Bycomparisonofthetargetclasswithoneorasetofcomparative classes

NIT3171 – ICT Business Analytics &DataVisualization 19

Frequentpatterns,associations,and
correlations
• Frequentpatterns
– Patternsthatoccurfrequentlyindata.
– Canbeitemsets,subsequences,andsubstructures
• Forexample,frequentitemsetsaresetsofitemsthatfrequentlyappear togetherinatransactionaldataset

NIT3171 – ICT Business Analytics &DataVisualization 20

Classificationandprediction

• Classificationistheprocessoffindingamodel(orfunction) thatdescribesanddistinguishdataclassesorconcepts,for thepurposeofbeingabletousethemodeltopredictthe classofobjectswhoseclasslabelisunknown.
• Classificationpredictscategorical(discrete,unordered)labels.
• Predictionmodelscontinuous-valuedfunctionsareusedto predictmissingorunavailablenumericaldatavaluesrather than classlabels.

NIT3171 – ICT Business Analytics &DataVisualization 21

Clusteranalysis

• Clusteringanalyzesdataobjectswithoutconsultingaknown
class label.
• Ingeneral,theclasslabelsarenotpresentinthetrainingdata simplybecausetheyarenotknowntobeginwith.
• Theobjectsareclusteredorgroupedbasedontheprincipleof maximizing the intraclass similarity and minimizing the interclasssimilarity.

NIT3171 – ICT Business Analytics &DataVisualization 22

Outlieranalysis

• Outliersaredataobjectsthatdonotcomplywiththegeneral
behavior or model of the data in a database.
• Normally,theseoutliersaretreatedasnoiseorexceptions, buttherareeventscanbemoreinterestingthanthemore regularly occurringones.

NIT3171 – ICT Business Analytics &DataVisualization 23

EvolutionAnalysis

• Dataevolutionanalysisdescribesandmodelsregularitiesor
trendsfor objectswhosebehaviorchangesover time.
• Thismayincludecharacterization,discrimination,association and correlation analysis, classification, prediction, or clusteringoftime-relateddata,distinctfeaturesofsuchan analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.

NIT3171 – ICT Business Analytics &DataVisualization 24

InterestingPatterns

• Apatternisinterestingifitis
– Easily understood byhumans
– Validonnewortestdatawithsomedegreeofcertainty
– Potentiallyuseful
– Novel
– abletovalidateahypothesisthattheusersoughttoconfirm
– Representing aknowledge

NIT3171 – ICT Business Analytics &DataVisualization 25

DataMiningModels&Tasks

NIT3171 – ICT Business Analytics &DataVisualization 26

Data MiningModels

NIT3171 – ICT Business Analytics &DataVisualization 27

Classification

• Goal:previouslyunseenrecordsshouldbeassignedaclassas
accurately aspossible.
• Findamodel forclassattributeasafunctionofthevaluesof otherattributes.
• Givenacollectionofrecords(trainingset)
– Eachrecordcontainsasetofattributes,oneoftheattributesisthe
class.
• Atestsetisusedtodeterminetheaccuracyofthemodel. Usually,thegivendatasetisdividedintotrainingandtest sets,withtrainingsetusedtobuildthemodelandtestset used to validateit.
NIT3171 – ICT Business Analytics &DataVisualization 28

Classification

NIT3171 – ICT Business Analytics &DataVisualization 29

Classification

NIT3171 – ICT Business Analytics &DataVisualization 30

Regression

• Predictfuturevaluesbasedonpastvalues
• Linear Regression assumes linear relationship exists. y=c0+c1x1+…+cnxn
• Findvaluestobestfitthedata

NIT3171 – ICT Business Analytics &DataVisualization 31

Regression

NIT3171 – ICT Business Analytics &DataVisualization 32

Clustering

• Givenasetofdatapoints,eachhavingasetofattributes,and
asimilaritymeasureamongthem,findclusterssuchthat
– Datapointsinoneclusteraremoresimilartooneanother.
– Datapointsinseparateclustersarelesssimilartooneanother.
• SimilarityMeasures:
– EuclideanDistanceifattributesarecontinuous.
– Other Problem-specificMeasures.

NIT3171 – ICT Business Analytics &DataVisualization 33

Clustering

NIT3171 – ICT Business Analytics &DataVisualization 34

Clustering

NIT3171 – ICT Business Analytics &DataVisualization 35

Association RuleMining

• Givenasetofrecordseachofwhichcontainsomenumberof
items from a given collection;
– Producedependencyruleswhichwillpredictoccurrenceofanitem basedonoccurrencesofotheritems.

NIT3171 – ICT Business Analytics &DataVisualization 36

Correlation

• Examinethedegreetowhichthevaluesfortwovariables
behavesimilarly.
• Correlation coefficientr:
 1 = perfectcorrelation
 -1=perfectbutoppositecorrelation
 0 = nocorrelation

NIT3171 – ICT Business Analytics &DataVisualization 37

42

Leave a Reply

Your email address will not be published.