Wednesday, August 22, 2012

Data Mining & Data warehousing unit 4 2 marks with Answers and 16 mark questions

Data Mining & Data warehousing    

Data Mining & Data warehousing  unit 4 2 marks with Answers and 16 mark questions 

Unit IV

Part A

1)      Define classification.

Data classification is a two step process. In the first step a model is built describing a predetermined set of data classes are concepts. The model is constructed by analyzing database tuples described b attributes. Each tuple is assumed to belong to the predefined class as determined by one of the attributes called class label attribute. In the second step the modal is used for classification.

2)      Define training data set.

The data tuples analyzed to build the model collectively form the training data set. Individual tuples making up the training set are referred to as training samples and a randomly selected from the sample population.

3)      Define accuracy of a model.

The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model.

4)      Define prediction.

Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled sample or to assess the value or value ranges of an attribute that a given sample is likely to have. Classification and regression are the two major types of prediction.

5)      Differentiate classification and prediction.

Classification is used to predict discrete or nominal values whereas prediction is used to predict continuous values. Classification is also known as supervised learning whereas prediction is also known as unsupervised learning.

6)      List the applications of classification and prediction.

Applications include credit approval, medical diagnosis, performance prediction, and selective marketing.

7)      List the preprocessing steps involved in preparing the data for classification and prediction.

Data cleaning, Relevance analysis, data transformation.

8)      Define normalization.

Normalization involves scaling all values for a given attribute so that they fall within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.

9)      What is a decision tree?

A decision tree is a flow chart like structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top most node in the tree is called root node.

10)  Define tree pruning.

When decision trees are built many of the branches may reflect noise or outliers in training data. Tree pruning attempts to identify and remove such branches with the goal of improving classification accuracy on unseen data.

11)  Define information gain.

The information gain measure is used to select the test attribute at each node in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions and reflects the least randomness is "impurity" in the partitions.

12)   List the two common approaches for tree pruning.

Prepruning approach – a tree is "Pruned" by halting its construction early. Upon halting the node becomes a leaf. The leaf may hold the most frequent class among the subsets samples or the probability distribution of the samples.

Post pruning approach – removes branches from a "fully grown" tree. A tree node is pruned by removing its branches the lowest unpruned node becomes the leaf and is labeled by the most frequent class among its former branches.

13)   List the problems in decision tree induction and how it can be prevented.

Fragmentation, repetition, and replication. Attribute construction is an approach for preventing these problems, where the limited representation of the given attributes is improved by creating new attributes based on the existing ones.

14)   What are Bayesian classifiers?

Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classification is based on bayes theorem. Bayesian classifiers exhibit high accuracy and speed when applied to large databases. Bayesian classifier also known as naïve Bayesian classifiers is comparable in performance with decision tree and neural network classifiers.

15)   Define Bayesian belief networks.

Bayesian belief networks are graphical models which allow the representation of dependencies among subsets of attributes. It can also be used for classification.

16)   List the two components in belief network.

Directed acyclic graph, conditional probability table.

17)   Define directed acyclic graph.

In Directed Acyclic graph each node represents a random variable and each arch represents a probabilistic dependence. If an arch is drawn from a node Y to a node Z then Y is a parent or immediate predecessor of Z and Z is a descendent of Y. Each variable is conditionally independent of its non-descendents in the graph given its parents. The variables may be discrete or continuous – valued.

18)   Define Conditional Probability Table (CPT).

The CPT for a variable Z specifies the conditional distribution

P (Z| parents (Z)), where parents (Z) are the parents of Z.

19)   List the methods used for classification based on concepts from association rule mining.

ARCS (Association Rule Clustering System), Associative classification, CAEP (Classification by Aggregating Emerging Patterns).

20)   Explain ARCS?

ARCS mines association rules of the form Aquan1 ^ Aquan2 => Acat where Aquan1 and Aquan2 are tests on quantitative attribute ranges and Acat assigns a class label for a categorical attribute from the given training data. Association rules are plotted on a 2D grid. The algorithm scans the grid, searching for rectangular clusters of rules. Adjacent ranges are the quantitative attributes occurring within a rule cluster may be combined. The clustered association rules generated by ARCS are applied to classification.

Part B

1)      Explain classification by decision tree induction in detail.

2)      Explain Bayesian classification in detail.

3)      Define cluster analysis. Explain the various kinds of data in detail.

4)      Explain partitioning methods in detail.

5)      Explain outlier analysis in detail.

  1. Explain how the operational environment is tested?
  2. Write an algorithm for K-nearest neighbour classification. Given k and n, the number of attributes describing each with sample.
  3. Write the major steps of decision tree classification
  4. Write a note on following: i. Visualization technique    ii Genetic algorithm
  5. Explain about OLAP tools


--
Hackerx Sasi
Don't ever give up.
Even when it seems impossible,
Something will always
pull you through.
The hardest times get even
worse when you lose hope.
As long as you believe you can do it, You can.

But When you give up,
You lose !
I DONT GIVE UP.....!!!

with regards
prem sasi kumar arivukalanjiam

No comments:

Post a Comment

Slider

Image Slider By engineerportal.blogspot.in The slide is a linking image  Welcome to Engineer Portal... #htmlcaption

Tamil Short Film Laptaap

Tamil Short Film Laptaap
Laptapp

Labels

About Blogging (1) Advance Data Structure (2) ADVANCED COMPUTER ARCHITECTURE (4) Advanced Database (4) ADVANCED DATABASE TECHNOLOGY (4) ADVANCED JAVA PROGRAMMING (1) ADVANCED OPERATING SYSTEMS (3) ADVANCED OPERATING SYSTEMS LAB (2) Agriculture and Technology (1) Analag and Digital Communication (1) Android (1) Applet (1) ARTIFICIAL INTELLIGENCE (3) aspiration 2020 (3) assignment cse (12) AT (1) AT - key (1) Attacker World (6) Basic Electrical Engineering (1) C (1) C Aptitude (20) C Program (87) C# AND .NET FRAMEWORK (11) C++ (1) Calculator (1) Chemistry (1) Cloud Computing Lab (1) Compiler Design (8) Computer Graphics Lab (31) COMPUTER GRAPHICS LABORATORY (1) COMPUTER GRAPHICS Theory (1) COMPUTER NETWORKS (3) computer organisation and architecture (1) Course Plan (2) Cricket (1) cryptography and network security (3) CS 810 (2) cse syllabus (29) Cyberoam (1) Data Mining Techniques (5) Data structures (3) DATA WAREHOUSING AND DATA MINING (4) DATABASE MANAGEMENT SYSTEMS (8) DBMS Lab (11) Design and Analysis Algorithm CS 41 (1) Design and Management of Computer Networks (2) Development in Transportation (1) Digital Principles and System Design (1) Digital Signal Processing (15) DISCRETE MATHEMATICS (1) dos box (1) Download (1) ebooks (11) electronic circuits and electron devices (1) Embedded Software Development (4) Embedded systems lab (4) Embedded systems theory (1) Engineer Portal (1) ENGINEERING ECONOMICS AND FINANCIAL ACCOUNTING (5) ENGINEERING PHYSICS (1) english lab (7) Entertainment (1) Facebook (2) fact (31) FUNDAMENTALS OF COMPUTING AND PROGRAMMING (3) Gate (3) General (3) gitlab (1) Global warming (1) GRAPH THEORY (1) Grid Computing (11) hacking (4) HIGH SPEED NETWORKS (1) Horizon (1) III year (1) INFORMATION SECURITY (1) Installation (1) INTELLECTUAL PROPERTY RIGHTS (IPR) (1) Internal Test (13) internet programming lab (20) IPL (1) Java (38) java lab (1) Java Programs (28) jdbc (1) jsp (1) KNOWLEDGE MANAGEMENT (1) lab syllabus (4) MATHEMATICS (3) Mechanical Engineering (1) Microprocessor and Microcontroller (1) Microprocessor and Microcontroller lab (11) migration (1) Mini Projects (1) MOBILE AND PERVASIVE COMPUTING (15) MOBILE COMPUTING (1) Multicore Architecute (1) MULTICORE PROGRAMMING (2) Multiprocessor Programming (2) NANOTECHNOLOGY (1) NATURAL LANGUAGE PROCESSING (1) NETWORK PROGRAMMING AND MANAGEMENT (1) NETWORKPROGNMGMNT (1) networks lab (16) News (14) Nova (1) NUMERICAL METHODS (2) Object Oriented Programming (1) ooad lab (6) ooad theory (9) OPEN SOURCE LAB (22) openGL (10) Openstack (1) Operating System CS45 (2) operating systems lab (20) other (4) parallel computing (1) parallel processing (1) PARALLEL PROGRAMMING (1) Parallel Programming Paradigms (4) Perl (1) Placement (3) Placement - Interview Questions (64) PRINCIPLES OF COMMUNICATION (1) PROBABILITY AND QUEUING THEORY (3) PROGRAMMING PARADIGMS (1) Python (3) Question Bank (1) question of the day (8) Question Paper (13) Question Paper and Answer Key (3) Railway Airport and Harbor (1) REAL TIME SYSTEMS (1) RESOURCE MANAGEMENT TECHNIQUES (1) results (3) semester 4 (5) semester 5 (1) Semester 6 (5) SERVICE ORIENTED ARCHITECTURE (1) Skill Test (1) software (1) Software Engineering (4) SOFTWARE TESTING (1) Structural Analysis (1) syllabus (34) SYSTEM SOFTWARE (1) system software lab (2) SYSTEMS MODELING AND SIMULATION (1) Tansat (2) Tansat 2011 (1) Tansat 2013 (1) TCP/IP DESIGN AND IMPLEMENTATION (1) TECHNICAL ENGLISH (7) Technology and National Security (1) Theory of Computation (3) Thought for the Day (1) Timetable (4) tips (4) Topic Notes (7) tot (1) TOTAL QUALITY MANAGEMENT (4) tutorial (8) Ubuntu LTS 12.04 (1) Unit Wise Notes (1) University Question Paper (1) UNIX INTERNALS (1) UNIX Lab (21) USER INTERFACE DESIGN (3) VIDEO TUTORIALS (1) Virtual Instrumentation Lab (1) Visual Programming (2) Web Technology (11) WIRELESS NETWORKS (1)

LinkWithin