Data Mining & Data warehousing
Data Mining & Data warehousing unit 4 2 marks with Answers and 16 mark questions
Unit IV
Part A
1) Define classification.
Data classification is a two step process. In the first step a model is built describing a predetermined set of data classes are concepts. The model is constructed by analyzing database tuples described b attributes. Each tuple is assumed to belong to the predefined class as determined by one of the attributes called class label attribute. In the second step the modal is used for classification.
2) Define training data set.
The data tuples analyzed to build the model collectively form the training data set. Individual tuples making up the training set are referred to as training samples and a randomly selected from the sample population.
3) Define accuracy of a model.
The accuracy of a model on a given test set is the percentage of test set samples that are correctly classified by the model.
4) Define prediction.
Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled sample or to assess the value or value ranges of an attribute that a given sample is likely to have. Classification and regression are the two major types of prediction.
5) Differentiate classification and prediction.
Classification is used to predict discrete or nominal values whereas prediction is used to predict continuous values. Classification is also known as supervised learning whereas prediction is also known as unsupervised learning.
6) List the applications of classification and prediction.
Applications include credit approval, medical diagnosis, performance prediction, and selective marketing.
7) List the preprocessing steps involved in preparing the data for classification and prediction.
Data cleaning, Relevance analysis, data transformation.
8) Define normalization.
Normalization involves scaling all values for a given attribute so that they fall within a small specified range such as -1.0 to 1.0 or 0.0 to 1.0.
9) What is a decision tree?
A decision tree is a flow chart like structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top most node in the tree is called root node.
10) Define tree pruning.
When decision trees are built many of the branches may reflect noise or outliers in training data. Tree pruning attempts to identify and remove such branches with the goal of improving classification accuracy on unseen data.
11) Define information gain.
The information gain measure is used to select the test attribute at each node in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain is chosen as the test attribute for the current node. This attribute minimizes the information needed to classify the samples in the resulting partitions and reflects the least randomness is "impurity" in the partitions.
12) List the two common approaches for tree pruning.
Prepruning approach – a tree is "Pruned" by halting its construction early. Upon halting the node becomes a leaf. The leaf may hold the most frequent class among the subsets samples or the probability distribution of the samples.
Post pruning approach – removes branches from a "fully grown" tree. A tree node is pruned by removing its branches the lowest unpruned node becomes the leaf and is labeled by the most frequent class among its former branches.
13) List the problems in decision tree induction and how it can be prevented.
Fragmentation, repetition, and replication. Attribute construction is an approach for preventing these problems, where the limited representation of the given attributes is improved by creating new attributes based on the existing ones.
14) What are Bayesian classifiers?
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classification is based on bayes theorem. Bayesian classifiers exhibit high accuracy and speed when applied to large databases. Bayesian classifier also known as naïve Bayesian classifiers is comparable in performance with decision tree and neural network classifiers.
15) Define Bayesian belief networks.
Bayesian belief networks are graphical models which allow the representation of dependencies among subsets of attributes. It can also be used for classification.
16) List the two components in belief network.
Directed acyclic graph, conditional probability table.
17) Define directed acyclic graph.
In Directed Acyclic graph each node represents a random variable and each arch represents a probabilistic dependence. If an arch is drawn from a node Y to a node Z then Y is a parent or immediate predecessor of Z and Z is a descendent of Y. Each variable is conditionally independent of its non-descendents in the graph given its parents. The variables may be discrete or continuous – valued.
18) Define Conditional Probability Table (CPT).
The CPT for a variable Z specifies the conditional distribution
P (Z| parents (Z)), where parents (Z) are the parents of Z.
19) List the methods used for classification based on concepts from association rule mining.
ARCS (Association Rule Clustering System), Associative classification, CAEP (Classification by Aggregating Emerging Patterns).
20) Explain ARCS?
ARCS mines association rules of the form Aquan1 ^ Aquan2 => Acat where Aquan1 and Aquan2 are tests on quantitative attribute ranges and Acat assigns a class label for a categorical attribute from the given training data. Association rules are plotted on a 2D grid. The algorithm scans the grid, searching for rectangular clusters of rules. Adjacent ranges are the quantitative attributes occurring within a rule cluster may be combined. The clustered association rules generated by ARCS are applied to classification.
Part B
1) Explain classification by decision tree induction in detail.
2) Explain Bayesian classification in detail.
3) Define cluster analysis. Explain the various kinds of data in detail.
4) Explain partitioning methods in detail.
5) Explain outlier analysis in detail.
- Explain how the operational environment is tested?
- Write an algorithm for K-nearest neighbour classification. Given k and n, the number of attributes describing each with sample.
- Write the major steps of decision tree classification
- Write a note on following: i. Visualization technique ii Genetic algorithm
- Explain about OLAP tools
No comments:
Post a Comment