Friday, August 31, 2012

Hacker Complete Collection :: Engineerportal : Hacking


Hacker Complete Collection

seeders: 29
leechers: 7
To download this torrent, you need a P2P BitTorrent client Vuze

Hacker Complete Collection (Size: 165.23 MB)
 Black Book
 Hack & Crack
 Xbox 360
 Bluetooth Hacking.pdf353.65 KB
 Ethical Hacking.pdf1.23 MB
 Firewalls And Networks How To Hack Into Remote Computers.pdf2.86 MB
 Google Hacks.pdf2.85 MB
 Hack Attacks Revealed.pdf8.06 MB
 Hack Attacks Testing - How To Conduct Your Own Security Audit.pdf9.56 MB
 Hack IT Security Through Penetration Testing.chm4.58 MB
 Hack Proofing Your Network - Internet Tradecraft.pdf2.95 MB
 Hack Proofing Your Network Second Edition.pdf8.77 MB
 Hack Proofing Your Web Server.pdf84.45 KB
 Hack The Net.pdf570.27 KB
 Hacking a Coke Machine.pdf116.99 KB
 Hacking and Network Defense.pdf152.59 KB
 Hacking for Dummies.pdf9.28 MB
 Hacking Intranet Websites.pdf5.41 MB
 Hacking Techniques.pdf5.03 MB
 Hacking The Cable Modem.pdf21.65 MB
 Hacking The Linux.pdf86.34 KB
 Hacking Web Applications.pdf7.6 MB
 Hacking Windows XP.pdf10.14 MB
 HackingPSP.pdf13.58 MB
 Kevin Mitnick - The Art of Intrusion.pdf3.07 MB
 Network Security Hacks - Tips & Tools For Protecting Your Privacy.chm3.85 MB
 PayPal Hacks.pdf123.91 KB
 PC Hacks.chm5.96 MB
 Simple Hacks - Addons, Macros And More.pdf855.06 KB
 The Database Hacker Handbook Defending Database Servers - CrOwN..chm1.13 MB
 Wireless Hacking.pdf3.91 MB
 Wireless Network Hacks & Mods for Dummies.pdf4.73 MB

Description

this is a complete collection to hack and crack everything you wanted :)

P.S:
the collections are complicated, which means that you have to have some basic knowledge.

1- acknowledged in PC
2- communication from PC 2 PC
3- Bluetooth
and and much more :)

don't worry, it's a complete guide for beginners and a guide for experts,
actually the collections had soft-wares, but i actually found that they contained lots of trajan and malwares and viruses that can extremely harm your pc, after thinking, i took the decisions to not include these soft-wares, coz i always want my .torrents files be 100% clean, after-all; you can download them each one from the e-book that contained it.

last thing: I suggest that if you want practice, practice on a pc that contain no important datas
thank you
:
)

Sunday, August 26, 2012

Oracle Data Mining ::: Assignment


Oracle Data Mining


1 Introduction to Oracle Data Mining

This chapter includes the following topics:
·         What is Data Mining?
·         What Is Oracle Data Mining?
·         New Features

1.1 What is Data Mining?

Too much data and not enough information — this is a problem facing many businesses and industries. Most businesses have an enormous amount of data, with a great deal of information hiding within it, but "hiding" is usually exactly what it is doing: So much data exists that it overwhelms traditional methods of data analysis.
Data mining provides a way to get at the information buried in the data. Data mining creates models to find hidden patterns in large, complex collections of data, patterns that sometimes elude traditional statistical approaches to analysis because of the large number of attributes, the complexity of patterns, or the difficulty in performing the analysis.

1.2 What Is Data Mining in the Database?

Data mining projects usually require a significant amount of data collection and data processing before and after model building. Data tables are created by combining many different types and sources of information. Real-world data is often dirty, that is, includes wrong or missing values; data must often be cleaned before it can be used. Data is filtered, normalized, sampled, transformed in various ways, and eventually used as input to data mining algorithms. Up to 80% of the effort in a data mining project is often devoted to data preparation. When the data is stored as a table in a database, data preparation can be performed using database facilities.
Data mining models have to be built, tested, validated, managed, and deployed in their appropriate application domain environments. The data mining results may need to be post-processed as part of domain specific computations (for example, calculating estimated risks, expected utilities, and response probabilities) and then stored into permanent databases or data warehouses.
Making the entire data mining process work in a reproducible and reliable way is challenging; it may involve automation and transfers across servers, data repositories, applications, and tools. For example, some data mining tools require that data be exported from the corporate database and converted to the data mining tool's format; data mining results must be imported into the database. Removing or reducing these obstacles can enable data mining to be utilized more frequently to extract more valuable information and, in many cases, to make a significant impact on the bottom-line of an enterprise. Data mining in the database makes the data movement required by tools that do not operate in the database unnecessary and make it much easier to mine up-to-date data. Also, the less data movement, the less time the entire data mining process takes.
Data movement can make data insecure. If data never leaves the database, database security protects the data.
In summary, data mining in the database provides the following benefits:
·         Less data movement
·         More data security
·         Up-to-date data

1.3 What Is Oracle Data Mining?

Oracle Data Mining (ODM) embeds data mining within the Oracle database. ODM algorithms operate natively on relational tables or views, thus eliminating the need to extract and transfer data into standalone tools or specialized analytic servers. ODM's integrated architecture results in a simpler, more reliable, and more efficient data management and analysis environment. Data mining tasks can run asynchronously and independently of any specific user interface as part of standard database processing pipelines and applications. Data analysts can mine the data in the database, build models and methodologies, and then turn those results and methodologies into full-fledged application components ready to be deployed in production environments. The benefits of the integration with the database cannot be emphasized enough when it comes to deploying models and scoring data in a production environment. ODM allows a user to take advantage of all aspects of Oracle's technology stack as part of an application. Also, fewer "moving parts" results in a simpler, more reliable, more powerful advanced business intelligence application.
ODM provides single-user multi-session access to models. ODM programs can run either asynchronously or synchronously in the Java interface. ODM programs using the PL/SQL interface run synchronously; to run PL/SQL asynchronously requires using the Oracle Scheduler. For a brief description of the ODM interfaces, see "Java and PL/SQL Interfaces".

1.3.1 Data Mining Functions

Data mining functions can be divided into two categories: supervised (directed) and unsupervised (undirected).
Supervised functions are used to predict a value; they require the specification of a target (known outcome). Targets are either binary attributes indicating yes/no decisions (buy/don't buy, churn or don't churn, etc.) or multi-class targets indicating a preferred alternative (color of sweater, likely salary range, etc.). Naive Bayes for classification is a supervised mining algorithm.
Unsupervised functions are used to find the intrinsic structure, relations, or affinities in data. Unsupervised mining does not use a target. Clustering algorithms can be used to find naturally occurring groups in data.
Data mining can also be classified as predictive or descriptive. Predictive data mining constructs one or more models; these models are used to predict outcomes for new data sets. Predictive data mining functions are classification and regression. Naive Bayes is one algorithm used for predictive data mining. Descriptive data mining describes a data set in a concise way and presents interesting characteristics of the data. Descriptive data mining functions are clustering, association models, and feature extraction. k-Means clustering is an algorithm used for descriptive data mining.
Different algorithms serve different purposes; each algorithm has advantages and disadvantages. A given algorithm can be used to solve different kinds of problems. For example, k-Means clustering is unsupervised data mining; however, if you use k-Means clustering to assign new records to a cluster, it performs predictive data mining. Similarly, decision tree classification is supervised data mining; however, the decision tree rules can be used for descriptive purposes.
Oracle Data Mining supports the following data mining functions:
·         Supervised data mining:
o        Classification: Grouping items into discrete classes and predicting which class an item belongs to
o        Regression: Approximating and forecasting continuous values
o        Attribute Importance: Identifying the attributes that are most important in predicting results
o        Anomaly Detection: Identifying items that do not satisfy the characteristics of "normal" data (outliers)
·         Unsupervised data mining:
o        Clustering: Finding natural groupings in the data
o        Association models: Analyzing "market baskets"
o        Feature extraction: Creating new attributes (features) as a combination of the original attributes
Oracle Data Mining permits mining of one or more columns of text data.
Oracle Data Mining also supports specialized sequence search and alignment algorithms (BLAST) used to detect similarities between nucleotide and amino acid sequences.

1.4 New Features

The following features are new in Oracle Data Mining 10g, Release 2 (compared with ODM 10g, Release 1):
·         New algorithms and improvements to existing algorithms:
o        Decision Tree algorithm, a fast, scalable means of extracting predictive and descriptive information from a database table with respect to a user-supplied target; Decision Tree provides human-understandable rules
o        One-Class Support Vector Machine algorithm for anomaly detection
o        Active learning, an enhancement to the Support Vector Machine algorithm that provides a way to deal with large build data sets
·         Completely revised Java interface consistent with the Java Data Mining (JDM) standard for data mining (JSR-73)
For information about JDM, see the Java Help for the JSR-73 Specification, available on the Oracle Technology Network athttp://www.oracle.com/technology/products/bi/odm/JSR-73/index.html
·         SQL Data Mining Functions for applying a model to new data: CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET,FEATURE_ID,FEATURE_SET, FEATURE_VALUE,PREDICTION, PREDICTION_COST, PREDICTION_DETAILS,PREDICTION_PROBABILITY, and PREDICTION_SET
·         O-cluster algorithm added to PL/SQL interface
·         Oracle Data Miner, a graphical user interface for ODM; Oracle Data Miner is distributed on Oracle Technology Network
·         The PL/SQL package DBMS_PREDICTIVE_ANALYTICS that automates the later stages of data mining; includes the following procedures:
o        EXPLAIN to rank attributes in order of influence in explaining a target column
o        PREDICT to predict the value of a target attribute (categorical or numerical)
·         Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine their Oracle Database or Excel data using the automated methodologies provided by DBMS_PREDICTIVE_ANALYTICS; the Add-In is distributed on Oracle Technology Network

2 Data for Oracle Data Mining

This chapter describes data requirements and how the data should be prepared before it is mined using either of the Oracle Data Mining (ODM) interfaces. The data preparation required depends on the type of model that you plan to build and the characteristics of the data. For example, data with attributes that take on a small number of values, that is, that have low cardinality, may not require binning.
In general, users must prepare data before invoking ODM algorithms.
The following topics are addressed:
·         Data, Cases, and Attributes
·         Data Requirements
·         Data Preparation

2.1 Data, Cases, and Attributes

Data used by ODM consists of tables or views stored in an Oracle database. Both ordinary tables and nested tables can be used as input data. The data used in a data mining operation is often called a data set.
Data has a physical organization and a logical interpretation. Column names refer to physical organization; attribute names, described in the next paragraph, refer to the logical interpretation of the data.
The rows of a data table are often called cases, records, or examples. The columns of the data tables are called attributes or fields; each attribute in a record holds a cell of information. Attribute names are constant from record to record for unnested tables; the values in the attributes can vary from record to record. For example, each record may have an attribute labeled "annual income." The value in the annual income attribute can vary from one record to another.
ODM distinguishes two types of attributes: categorical and numerical. Categorical attributes are those that define their values as belonging to a small number of discrete categories or classes; there is no implicit order associated with the values. If there are only two possible values, for example, yes and no, or male and female, the attribute is said to be binary. If there are more than two possible values, for example, small, medium, large, extra large, the attribute is said to be multiclass.
Numerical attributes are numbers that take on a large number of values that have an order, for example, annual income. For numerical attributes, the differences between values are also ordered. Annual income could theoretically be any value from zero to infinity, though in practice annual income occupies a bounded range and takes on a finite number of values.
You can often transform numerical attributes to categorical attributes. For example, annual income could be divided into three categories: low, medium, high. Conversely, you can explode categorical values to transform them into numerical values.
Classification and Regression algorithms require a target attribute. A supervised model can predict a single target attribute. The target attribute for all classification algorithms can be numerical or categorical. The ODM regression algorithm supports only numerical target attributes.
Certain ODM algorithms support unstructured text attributes. Although unstructured data includes images, audio, video, geospatial mapping data, and documents or text, ODM supports mining text data only. An input table can contain one or more text columns.

2.2 Data Requirements

ODM supports several types of input data, depending on data table format, column data type, and attribute type.

2.2.1 ODM Data Table Format

ODM data must reside in a single table or view in an Oracle database. The table or view must be a standard relational table, where each case is represented by one row in the table, with each attribute represented by a column in the table. The columns must be of one of the types supported by ODM.

2.2.2 Column Data Types Supported by ODM

ODM does not support all the data types that Oracle supports. Each attribute (column) in a data set used by ODM must have one of the following data types:
·         INTEGER
·         NUMBER
·         FLOAT
·         VARCHAR2
·         CHAR
·         DM_NESTED_NUMERICALS (nested column)
·         DM_NESTED_CATEGORICALS (nested column)
The supported attribute data types have a default attribute type (categorical or numerical). For details, see Oracle Data Mining Application Developer's Guide.

2.2.2.1 Nested Columns in ODM

Nested table columns can be used for capturing in a single table or view data that is distributed over many tables (for example, a star schema). Nested columns allow you to capture one-to-many relationships (for example, one customers can buy many products). Nested tables are required if the data has more than 1000 attributes; nested tables are useful if the data is sparse, or if the data is already persisted in a transactional format and must be passed to the data mining interface through an object view.

Note:
The Decision Tree algorithm, described in "Decision Tree Algorithm", does not support nested columns.

The fixed collection types DM_NESTED_NUMERICALS and DM_NESTED_CATEGORICALS are used to define columns that represent collections of numerical attributes and categorical attributes, respectively.
For a given case identifier, attribute names must be unique across all the collections and individual columns. The fixed collection types enforce this requirement.

2.2.3 Missing Values

Data tables often contain missing values.

2.2.3.1 Missing Values and NULL Values in ODM

Certain algorithms assume that a NULL value indicates a missing value; others assume that a NULL value indicates sparse data, as described in "Sparse Data" .

2.2.3.2 Missing Value Handling

ODM is robust in handling missing values and does not require users to treat missing values in any special way. ODM will ignore missing values but will use non-missing data in a case.
If an algorithm assumes that NULL values indicate sparse data, then you should treat any values that are true missing values.

2.2.4 Sparse Data

Data is said to be sparse if only a small fraction (no more than 20%, often 3% or less) of the attributes are non-zero or non-null for any given case. Sparse data occurs, for example, in market basket problems. In a grocery store, there might be 10,000 products in the store, and the average size of a basket (the collection of distinct items that a customer purchases in a typical transaction) is on average 50 products. In this example, a transaction (case or record) has on average 50 out of 10,000 attributes that are not null. This implies that the fraction of non-zero attributes in the table (or the density) is approximately 50/10,000, or 0.5%. This density is typical for market basket and text mining problems.
Association models are designed to process sparse data; indeed, if the data is not sparse, the algorithm may require a large amount of temporary space or may not be able to build a model.
Sparse data is represented in a table in such a way that avoids the specification of the most common value to save storage. In such a specification of sparse data, a missing value is implicitly interpreted as the most common value.
Different algorithms make different assumptions about what indicates sparse data. For Support Vector Machine, k-Means, association, and Non-Negative Matrix Factorization, NULL values indicate sparse data; for all other algorithms, NULL values indicate missing values. See the description of each algorithm for information about how it interprets NULL values.

2.2.5 Outliers and Oracle Data Mining

An outlier is a value that is far outside the normal range in a data set, typically a value that is several standard deviations from the mean. The presence of outliers can have a significant impact on certain kinds of ODM models. Naive Bayes, Adaptive Bayes Network, Support Vector Machine, Attribute Importance, either clustering algorithm, and Non-Negative Matrix Factorization are sensitive to outliers.
For example, the presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of the model to detect differences in numerical attributes may be significantly lessened. For example, a numerical attribute such as income may have all the data belonging to a single bin except for one entry (the outlier) that belongs to a different bin.
For outlier treatments, see "Winsorizing and Trimming" .

2.3 Data Preparation

Data is said to be prepared when certain data transformations required by a data mining algorithm are performed by the user before the algorithm is invoked. For most algorithms, data must be prepared before the algorithm is invoked.
Different algorithms have different requirements for data preparation; recommended data preparation is discussed with each algorithm in Chapter 3 and Chapter 4.
Data preparation can take many forms, such as joining two or more tables so that all required data is in a single table or view, transforming numerical attributes by applying numerical functions to them, recoding attributes, treating missing values, treating outliers, omitting selected columns for a training data set, and so forth.
ODM includes transformations that perform the following data-mining-specific transformations:
·         Winsorizing and trimming to treat outliers
·         Discretization to reduce the number of distinct values for an attribute
·         Normalization to convert attribute values to a common range

2.3.1 Winsorizing and Trimming

Certain algorithms are sensitive to outliers. Winsorizing and trimming transformations are used to deal with outliers.
Winsorizing involves setting the tail values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 6th percentile, while the upper 5% are set equal to the maximum value in the 95th percentile.
Trimming removes the tails in the sense that trimmed values are ignored in further computations. This is achieved by setting the tails to NULL. This process is sometimes called clipping.

2.3.2 Binning (Discretization)

Some ODM algorithms may benefit from binning (discretizing) both numeric and categorical data. Naive Bayes, Adaptive Bayes Network, Clustering, Attribute Importance, and Association Rules algorithms may benefit from binning.
Binning means grouping related values together, thus reducing the number of distinct values for an attribute. Having fewer distinct values typically leads to a more compact model and one that builds faster. Binning must be performed carefully. Proper binning can improve model accuracy; improper binning can lead to loss in accuracy.

2.3.2.1 Methods for Computing Bin Boundaries

ODM utilities provide three methods for computing bin boundaries from the data:
·         Top N most frequent items: This technique is used to bin categorical values. The bin definition for each attribute is computed based on the occurrence frequency of values that are computed from the data. The user specifies a particular number of bins, say N. Each of the bins bin_1,..., bin_N corresponds to the values with top frequencies. The bin bin_N+1 corresponds to all remaining values.
·         Equi-Width Binning: This technique is used to bin numerical values. For numerical attributes, ODM finds the minimum (min) and maximum (max) values for every attribute in the data. Then ODM divides the [min, max] range into N equal bins of size d=(max-min)/N. Thus bin 1 is [min, min+d), bin 2 is [min+d, min+2d), and bin N is [min+(N-1)*d,max]. The number of bins can either be specified by the user or calculated by the transformation.
·         Quantile Binning: This technique is used to bin numerical values. The definition for each relevant attribute is computed based on the minimum values for each quantile, where quantiles are computed from the data using NTILE function. Bins bin_1,..., bin_N span the following ranges: bin_1 spans [min_1,min_2]; bin_2,..., bin_i,..., bin_N-1 span (min_i, min_(i+1)] and bin_N spans (min_N, max_N]. Bins with equal left and right boundaries are collapsed.

2.3.3 Normalization

Normalization converts individual numerical attributes so that each attribute's values lie in the same range. Values are converted to be in the range 0.0 to 1.0 or the range -1.0 to 1.0. Normalization ensures that attributes do not receive artificial weighting caused by differences in the ranges that they span. Some algorithms, such as k-Means, Support Vector Machine, and Non-Negative Matrix Factorization, benefit from normalization.

3 Supervised Data Mining

This chapter describes supervised models; supervised models are sometimes referred to as predictive models. These models predict a target value. The Java and PL/SQL Oracle Data Mining interfaces support the following supervised functions:
·         Classification
·         Regression
·         Attribute Importance
·         Anomaly Detection
This chapter also describes
·         Testing Supervised Models

3.1 Classification

Classification of a collection consists of dividing the items that make up the collection into categories or classes. In the context of data mining, classification is done using a model that is built on historical data. The goal of predictive classification is to accurately predict the target class for each record in new data, that is, data that is not in the historical data.
A classification task begins with build data (also known as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. These relations are summarized in a model; the model can then be applied to new cases with unknown target values to predict target values. A classification model can also be applied to data that was held aside from the training data to compare the predictions to the known target values; such data is also known as test data or evaluation data. The comparison technique is called testing a model, which measures the model's predictive accuracy. The application of a classification model to new data is called applying the model, and the data is called apply data or scoring data. Applying a model to data is often called scoring the data.
Classification is used in customer segmentation, business modeling, credit analysis, and many other applications. For example, a credit card company may wish to predict which customers are likely to default on their payments. Customers are divided into two classes: those who default and those who do not default. Each customer corresponds to a case; data for each case might consist of a number of attributes that describe the customer's spending habits, income, demographic attributes, etc. These are the predictor attributes. The target attribute indicates whether or not the customer has defaulted. The build data is used to build a model that predicts whether new customers are likely to default.
Classification problems can have either binary and multiclass targets. Binary targets are those that take on only two values, for example, good credit risk and poor credit risk. Multiclass targets have more than two values, for example, the product purchased (comb or hair brush or hair pin). Multiclass target values are not assumed to exist in an ordered relation to each other, for example, hair brush is not assumed to be greater or less than comb.
Classification problems may require the specification of Costs, described and Priors, described.

3.1.1 Algorithms for Classification

ODM provides the following algorithms for classification:
·         Decision Tree Algorithm
·         Naive Bayes Algorithm
Table 3-1 compares several important features of the classification algorithms.
Table 3-1 Classification Algorithm Comparison
Feature
Naive Bayes
Adaptive Bayes Network
Support Vector Machine
Decision Tree
Speed
Very fast
Fast
Fast with active learning
Fast
Accuracy
Good in many domains
Good in many domains
Significant
Good in many domains
Transparency
No rules (black box)
Rules for Single Feature Build only
No rules (black box)
Rules
Missing value interpretation
Missing value
Missing value
Sparse data
Missing value

3.1.1.1 Decision Tree Algorithm

Decision tree rules provide model transparency so that a business user, marketing analyst, or business analyst can understand the basis of the model's predictions, and therefore, be comfortable acting on them and explaining them to others.
In addition to transparency, the Decision Tree algorithm provides speed and scalability. The build algorithm scales linearly with the number of predictor attributes and on the order of nlog(n) with the number of rows, n. Scoring is very fast. Both build and apply are parallelized. The Decision Tree algorithm builds models for binary and multi-class targets. It produces accurate and interpretable models with relatively little user intervention required. The Decision Tree algorithm is implemented in such a way as to handle data in the typical data table formats, to have reasonable defaults for splitting and termination criteria, to perform automatic pruning, and to perform automatic handling of missing values. However, it does not distinguish sparse data from missing data. (See "Sparse Data"for more information.) Users can specify costs and priors.
Decision Tree does not support nested tables.
Decision Tree Models can be converted to XML.
3.1.1.1.1 Decision Tree Rules
A Decision Tree model always produces rules. Decision tree rules are in the form "IF predictive information THEN target," as in "IF income is greater than $70K and household size is greater than 3 THEN the probability of Churn is 0.075."
3.1.1.1.2 XML for Decision Tree Models
You can generate XML representing a decision tree model; the generated XML satisfies the definition specified in the Data Mining Group Predictive Model Markup Language (PMML) version 2.1 specification. The specification is available at http://www.dmg.org.

3.1.1.2 Naive Bayes Algorithm

The Naive Bayes algorithm (NB) can be used for both binary and multiclass classification problems.
NB builds and scores models extremely rapidly; it scales linearly in the number of predictors and rows.
NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence. Bayes' Theorem states that the probability of event A occurring given that event B has occurred (P(A|B)) is proportional to the probability of event B occurring given that event A has occurred multiplied by the probability of event A occurring ((P(B|A)P(A)).
Naive Bayes makes the assumption that each attribute is conditionally independent of the others, that is, given a particular value of the target, the distribution of each predictor is independent of the other predictors. In practice, this assumption of independence, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.

3.1.1.3 Adaptive Bayes Network Algorithm

Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. (Non-parametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.)
ABN, in Single Feature Build mode, can describe the model in the form of human-understandable rules. The rules produced by ABN are one of its main advantages over Naive Bayes. ABN rules provide model transparency so that a business user, marketer, or business analyst can understand the basis of the model's predictions and therefore, be comfortable acting on them and explaining them to others.
In addition to rules, ABN provides performance and scalability, which are derived via various user parameters controlling the trade-off of accuracy and build time.
ABN predicts binary as well as multiclass targets.
ABN can use costs and priors for both building and scoring (see "Costs" and "Priors").
3.1.1.3.1 ABN Model Types
An ABN model is an (adaptive conditional independence model that uses the minimum description length principle to construct and prune an array of conditionally independent network features. Each network feature consists of one or more conditional probability expressions. The collection of network features forms a product model that provides estimates of the target class probabilities. There can be one or more network features. The number and depth of the network features in the model determine the model mode. There are three model modes for ABN:
·         Pruned Naive Bayes (Naive Bayes Build)
·         Simplified decision tree (Single Feature Build)
·         Boosted (Multi Feature Build)
Users can select the ABN model type. Rules are available only for Single Feature Build.
Each network feature consists of one or more attributes included in a conditional probability expression. An array of single attribute network features is an MDL-pruned Naive Bayes model. A single multi-attribute network feature model is equivalent to a simplified C4.5 decision tree; such a model is simplified in the sense that numerical attributes are binned and treated as categorical. Furthermore, a single predictor is used to split all nodes at a given tree depth. The splits are k-way, where k is the number of unique (binned) values of the splitting predictor. Finally, a collection of multi-attribute network features forms a product model (boosted mode). All three types provide estimates of the target class probabilities.
3.1.1.3.2 ABN Rules
Rules can be extracted from an Adaptive Bayes Network model as compound predicates. Rules form a human-interpretable depiction of the model and include statistics indicating the number of the relevant training data instances in support of the rule. A record apply instance specifies a pathway in a network feature taking the form of a compound predicate.

Note:
Rules are generated for the single feature build model type only.

For example, suppose the feature consists of two training attributes: Age {20-40, 40-60, 60-80} and Income {<=50K, >50K}. A record instance consisting of a person age 25 and income $42K is expressed as
IF AGE IN (20-40) and INCOME IN (<=50K)
Suppose that the associated target (for example, response to a promotion) probabilities are {0.8 (no), 0.2 (yes)}. Then we have a detailed rule of the form
IF AGE IN (20-40) and INCOME IN (<=50K) THEN Probability = {0.8, 0.2}
In addition to the probability distribution, there are the associated training data counts, e.g. {400, 100}.
Suppose there is a cost matrix specifying that it is 6 times more costly to predict a no incorrectly than it is to predict a yesincorrectly. Then the cost of predicting yes for this instance is 0.8 * 1 = 0.8 (because the model is wrong in this prediction 80% of the time) and the cost of predicting no is 0.2 * 6 = 1.2. Thus, the minimum cost (best) prediction is yes. Without the cost matrix, the decision is reversed. Implicitly, all errors are equal and we have: 0.8 * 1 = 0.8 for yes and 0.2 * 1 = 0.2 for no.
The order of the predicates in the generated rules implies relative importance.
When you apply an ABN model for which rules were generated, with a single feature, you get the same result that you would get if you wrote an external program that applied the rules.

3.1.1.4 Support Vector Machine Algorithm

Support Vector Machine (SVM) is a state-of-the-art classification and regression algorithm. SVM is an algorithm with strong regularization properties, that is, the optimization procedure maximizes predictive accuracy while automatically avoiding over-fitting of the training data. Neural networks and radial basis functions, both popular data mining techniques, have the same functional form as SVM models; however, neither of these algorithms has the well-founded theoretical approach to regularization that forms the basis of SVM.
SVM projects the input data into a kernel space. Then it builds a linear model in this kernel space. A classification SVM model attempts to separate the target classes with the widest possible margin. A regression SVM model tries to find a continuous function such that maximum number of data points lie within an epsilon-wide tube around it. Different types of kernels and different kernel parameter choices can produce a variety of decision boundaries (classification) or function approximators (regression). The ODM SVM implementation supports two types of kernels: linear and Gaussian. ODM also provides automatic parameter estimation on the basis of the characteristics of the data.
SVM performs well with real-world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and biosequence analysis. The introduction of SVM in the early 1990s led to an explosion of applications and deepening theoretical analysis that established SVM along with neural networks as one of the standard tools for machine learning and data mining.
There is no upper limit on the number of attributes and target cardinality for SVMs; the only constraints are those imposed by hardware.
SVM is the preferred algorithm for sparse data.
The following new features have been added to the SVM algorithm in ODM 10g Release 2:
·         SVM can also be used to identify novel or anomalous patterns using one-class SVM. For more information, see "Anomaly Detection".
·         SVM supports active learning. For more information, see "Active Learning".
·         SVM automatically creates stratified samples for large training sets (see "Sampling for Classification") and automatically chooses a kernel type for model build (see "Automatic Kernel Selection").
3.1.1.4.1 Active Learning
SVM models grow as the size of the training data set increases. This property limits SVM models to small and medium size training sets (less than 100,000 cases). Active learning provides a way to deal with large training sets.
The termination criteria for active learning is usually an upper bound on the number of support vectors; when the upper bound is attained, the build stops. Alternatively, stopping criteria are qualitative, such as no significant improvement in model accuracy on a held-aside sample.
Active learning forces the SVM algorithm to restrict learning to the most informative training examples and not to attempt to use the entire body of data. In most cases, the resulting models have predictive accuracy comparable to that of the standard (exact) SVM model.
Active learning can be applied to all SVM models (classification, regression, and one-class).
Active learning is on by default. It can be turned off.
3.1.1.4.2 Sampling for Classification
For classification, SVM automatically performs stratified sampling during model build. The algorithm scans the entire build data set and selects a sample that is balanced across target values.
3.1.1.4.3 Automatic Kernel Selection
SVM automatically determines the appropriate kernel type based on build data characteristics. This selection can be overridden by explicitly specifying a kernel type.
3.1.1.4.4 Data Preparation and Settings Choice for Support Vector Machines
You can influence both the Support Vector Machine (SVM) model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings. Significant performance degradation can be caused by a poor choice of settings or inappropriate data preparation. Poor settings choices can also lead to inaccurate models.
For detailed information about data preparation for SVM models, see the Oracle Data Mining Application Developer's Guide.
SVM has built-in mechanisms that attempt to choose appropriate settings automatically based on the data provided. You may need to override the system-determined settings for some domains.

3.1.2 Data Preparation for Classification

This section summarizes data preparation that may be required by classification algorithms.

3.1.2.1 Outliers

Outliers affect classification algorithms as follows:
·         Naive Bayes and Adaptive Bayes Network: The presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of these algorithms may be significantly reduced. In this case, quantile binning helps to overcome these problems.
·         Support Vector Machine: The presence of outliers can significantly impact models. Use a clipping transformation to avoid the problems caused by outliers.
·         Decision Tree: The presence of outliers does not impact decision tree models.

3.1.2.2 NULL Values

The meaning of NULL values and how to treat them depends on the algorithm as follows:
·         Support Vector Machine: NULL values indicate sparse data. Missing values are not automatically handled. If the data is not sparse and the values are indeed missing at random, it is necessary to perform missing data imputation (that is, perform some kind of missing values treatment) and substitute a non-NULL value for the NULL value. One simple approach is to use the mean for numerical attributes and the mode for categorical attributes. If you do not treat missing values, the algorithm will not handle the data correctly.
·         For all other classification algorithms, NULL values indicate missing values:
o        Decision Tree, Naive Bayes, and Adaptive Bayes Network: Missing values are handled automatically.

3.1.2.3 Normalization

Support Vector Machine may benefit from normalization.

3.1.3 Costs

In a classification problem, it may be important to specify the costs involved in making an incorrect decision. Doing so can be useful when the costs of different misclassifications vary significantly.
For example, suppose the problem is to predict whether a user will respond to a promotional mailing. The target has two categories: YES (the customer responds) and NO (the customer does not respond). Suppose a positive response to the promotion generates $500 and that it costs $5 to do the mailing. If the model predicts YES and the actual value is YES, the cost of misclassification is $0. If the model predicts YES and the actual value is NO, the cost of misclassification is $5. If the model predicts NO and the actual value is YES, the cost of misclassification is $500. If the model predicts NO and the actual value is NO, the cost is $0. In this case, you would probably want to avoid cases where the model predicts NO and the actual value is YES
Exactly how costs are specified depends on the classification algorithm used:
·         NB and ABN use a cost matrix
·         SVM uses weights
The cost of misclassification is summarized in a cost matrix. The rows of the matrix represent actual values and the columns, predicted values. A cell in the matrix represents the misclassification cost that occurs when the model predicts the class indicated by the column when the class is really the one specified by the row.
Weights for an SVM model are automatically initialized to achieve the best average prediction across all target values. If you change a weight value, the percentage of correct predictions changes in the same way; for example, if you increase a weight value, the percent of correct predictions increases for the associated class.
Classification algorithms apply the cost information to the predicted probabilities during scoring to estimate the least expensive prediction. If a cost matrix is specified for scoring, the output of the scoring is the minimum cost for the prediction. If no cost matrix is supplied, the output is the most likely prediction.
You must be careful how you assign costs. You are making a trade-off between false-positives (falsely accusing someone of fraud) and false negatives (letting a crime go unpunished). Your costs should reflect this trade-off. Perhaps you are willing to let some crimes go unpunished so that you don't falsely accuse millions of committing fraud; for example, you must be sure that you are right before you accuse someone (say 99%, rather than just 50% sure). Predicting on probability means you are indifferent to the type of error you make. If you are concerned about the type of error, a cost matrix or carefully adjusted weights are warranted.

3.1.4 Priors

In building a classification model, describing the distribution in the real population can be useful when the training data does not accurately reflect the real population. The real population is described by providing the prior distribution, often referred to as the priors, to the build operation.
In many problems with a binary target, one target value dominates in frequency. For example, the positive responses for a telephone marketing campaign may be 2% or less, and the occurrence of fraud in credit card transactions may be less than 1%. A classification model built on historic data of this type may not observe enough positive cases to be able to distinguish the characteristics of the two classes; the result could be a model that when applied to new data predicts the negative class for every case. While such a model may be highly accurate, it may not be very useful. This illustrates that it is not a good idea to rely solely on accuracy when judging a model.One solution to this problem involves creating a source table for the build operation that contains approximately equal numbers of each target value. However, the algorithm will take the observed distribution as realistic, and will build a model that will predict each of the target values in equal numbers unless it is instructed otherwise. Supplying the actual distribution of target values, the priors, to the Build process can result in a more effective model.Note that the model should be tested against data that has the actual distribution of target values. For example, 98% negative and 2% positive for the marketing campaign.

3.2 Regression

Regression models are similar to classification models. The difference between regression and classification is that regression deals with numerical or continuous target attributes, whereas classification deals with discrete or categorical target attributes. In other words, if the target attribute contains continuous (floating-point) values or integer values that have inherent order, a regression technique can be used. If the target attribute contains categorical values, that is, string or integer values where order has no significance, a classification technique is called for. Note that a continuous target can be turned into a discrete target by binning; this turns a regression problem into a problem that can be solved using classification algorithms.

3.2.1 Algorithm for Regression

Support Vector Machine (SVM) builds both classification and regression models. For more information about SVM, see "Support Vector Machine Algorithm".
ODM SVM provides improved data-driven estimation of epsilon and the complexity factor for SVM regression models.
Active learning (see "Active Learning") can be used for regression models.
One-class SVM (see "Anomaly Detection") cannot be used for regression problems.

3.3 Attribute Importance

Attribute Importance (AI) provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.
The time required to build ODM classification models increases with the number of attributes. Attribute Importance identifies a proper subset of the attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes only.
Using fewer attributes does not necessarily result in lost predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. Mining using the smallest number of attributes can save significant computing time and may build better models.
The programming interfaces for Attribute Importance permit the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point.

3.3.1 Data Preparation for Attribute Importance

The presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the discriminating power of an attribute importance model may be significantly reduced. In this case, quantile binning helps to overcome these problems.
NULL values are treated as missing values and not as indicators of sparse data.

3.3.2 Algorithm for Attribute Importance

ODM uses the Minimum Descriptor Length algorithm for attribute importance.

3.3.2.1 Minimum Description Length Algorithm

Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of the data. The MDL principle is used to build ODM Attribute Importance models.
MDL considers each attribute as a simple predictive model of the target class. These single predictor models are compared and ranked with respect to the MDL metric (compression in bits). MDL penalizes model complexity to avoid over-fit. It is a principled approach that takes into account the complexity of the predictors (as models) to make the comparisons fair.
With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data.
Attribute importance uses a two-part code to transmit the data. The first part (preamble) transmits the model. The parameters of the model are the target probabilities associated with each value of the prediction. For a target with j values and a predictor with k values,ni (i= 1,..., k) rows per value, there are Ci, the combination of j-1 things taken ni-1 at time possible conditional probabilities. The size of the preamble in bits can be shown to be Sum(log2(Ci)), where the sum is taken over k. Computations like this represent the penalties associated with each single prediction model. The second part of the code transmits the target values using the model.
It is well known that the most compact encoding of a sequence is the encoding that best matches the probability of the symbols (target class values). Thus, the model that assigns the highest probability to the sequence has the smallest target class value transmission cost. In bits this is the Sum(log2(pi)), where the pi are the predicted probabilities for row i associated with the model.
The predictor rank is the position in the list of associated description lengths, smallest first.

3.4 Anomaly Detection

Anomaly detection consist of identifying novel or anomalous patterns. Identifying such patterns can be useful in problems of fraud detection (insurance, tax, credit card, etc.) and computer network intrusion detection. An anomaly detection model predicts whether a data point is typical for a given distribution or not. An atypical data point can be either an outlier or an example of a previously unseen class.
An anomaly detection model discriminates between the known examples of the positive class and the unknown negative set of counterexamples. An anomaly detection model identifies items that do not fit in the distribution.
Anomaly detection is a mining function in the Oracle Data Miner interface. In the ODM Java and PL/SQL interfaces, an anomaly detection model is a classification model. See "Specify the One-Class SVM Algorithm" for more information.

3.4.1 Algorithm for Anomaly Detection

Standard binary supervised classification algorithms, such as Naive Bayes, require the presence of both positive examples and negative examples (counterexamples) of a target class. One-class SVM classification requires only the presence of examples of a single target class. The model learns to discriminate between the known examples of the positive class and the unknown negative set of counterexamples. In other words, one-class SVM detects anomalies. One-class SVM was initially used to estimate the support of a distribution. The goal is to estimate a function that will be positive if an example belongs to a set and negative if the example belongs to the complement of the set. The model computes a binary function that identifies regions in the input space where the majority of the positive data lives.
One-class SVM models are useful in situations such as:
·         Outlier detection
·         Cases where it is difficult to provide counterexamples
In outlier detection, you separate the typical examples in a distribution from the atypical (outlier) examples. The distance from the separating plane indicates how typical a given point is with respect to the distribution of the training data. Outliers can be ranked on the basis of the probability of them being typical or atypical cases. In a similar way, you can use one-class SVM to detect other kinds of anomalies.
In some classes of problems, it is difficult or impossible to provide a useful and representative set of counterexamples. Examples may be easy to identify, but counterexamples are either hard to specify or expensive to collect. For example, in text document classification, it is easy to classify a document under a given topic. However, the universe of documents not belonging to this topic can be very large and it may not be feasible to provide counterexamples.
The accuracy of one-class SVM models cannot usually match the accuracy of standard SVM classifiers built with meaningful counterexamples.
SVM is the only supervised classification algorithm in ODM that can operate in one-class mode.
One-class SVM cannot be used for regression problems.

3.4.1.1 Specify the One-Class SVM Algorithm

To build a one-class SVM model in either of the ODM programmatic interfaces, select classification as the mining function, SVM as the algorithm, and pass a NULL or empty string as the target column name.
To build a one-class SVM model in Oracle Data Miner, select anomaly detection.

3.5 Testing Supervised Models

Supervised models are tested to evaluate the accuracy of their predictions. A model is tested by applying it to data with known values of the target and comparing the model's predicted values with the known values. The test data must be compatible with the data used to build the model and must be prepared in the same way that the build data was.
Model testing results in the calculation of test metrics. The exact test metrics calculated depend on the type of model. Testing a classification model results in a confusion matrix; testing a regression model results in error estimates. Optionally, lift and receiver operating characteristics (ROC) can be calculated for classification models.
The rest of this section describes these test metrics, as follows:
·         Confusion Matrix
·         Lift

3.5.1 Confusion Matrix

ODM supports the calculation of a confusion matrix to asses the accuracy of a classification model. The simplest example of a confusion matrix is one for a binary classification problem. For a binary problem, the confusion matrix is a two-dimensional square matrix. (In general, the confusion matrix is an n-dimensional square matrix, where n is the number of distinct target values.) The row indexes of a confusion matrix correspond to actual values observed and used for model testing; the column indexes correspond topredicted values produced by applying the model to the test data. For any pair of actual/predicted indexes, the value indicates the number of records classified in that pairing. Figure 3-1 contains an example of a confusion matrix. For example, a value of 25 for an actual value index of "buyer" and a predicted value index of "nonbuyer" indicates that the model incorrectly classified a "buyer" as a "nonbuyer" 25 times. A value of 516 for an actual/predicted value index of "buyer" indicates that the model correctly classified a "buyer" 516 times.
The predictions were correct 516 + 725 = 1241 times, and incorrect 25 + 10 = 35 times. The sum of the values in the matrix is equal to the number of scored records in the input data table. The number of scored records is the sum of correct and incorrect predictions, which is 1241 + 35 = 1276. The error rate is 35/1276 = 0.0274; the accuracy rate is 1241/1276 = 0.9725.
A confusion matrix provides a quick understanding of model accuracy and the types of errors the model makes when scoring records. It is the result of a test task for classification models.
Figure 3-1 Confusion Matrix
Description of Figure 3-1  follows 
Description of "Figure 3-1 Confusion Matrix" 

3.5.2 Lift

ODM supports computing lift for a classification model. Lift can be computed for binary (two values) target fields. Lift can also be computed for multiclass targets by designating a preferred positive class and combining all other target class values, effectively turning a multiclass target into a binary target. Given a designated positive target value (that is, the value of most interest for prediction, such as "buyer," or "has disease"), test cases are sorted according to how confidently they are predicted to be positive cases. Positive cases with highest confidence come first, followed by positive cases with lower confidence. Negative cases with lowest confidence come next, followed by negative cases with highest confidence. Based on that ordering, they are partitioned into quantiles, and the following statistics are calculated by the programming interfaces:
·         Probability threshold for a quantile n is the minimum probability for the positive target to be included in this quantile or any preceding quantiles (quantiles n-1, n-2,..., 1). If a cost matrix is used, a cost threshold is reported instead. The cost threshold is the maximum cost for the positive target to be included in this quantile or any of the preceding quantiles.
·         Cumulative gain for a given quantile is the ratio of the cumulative number of positive targets to the total number of positive targets.
·         Target density of a quantile is the number of true positive instances in that quantile divided by the total number of instances in the quantile.
·         Cumulative target density for quantile n is the target density computed over the first n quantiles.
·         Quantile lift is the ratio of target density for the quantile to the target density over all the test data.
·         Cumulative percentage of records for a given quantile is the percentage of all test cases represented by the first nquantiles, starting at the end that is most confidently positive, up to and including the given quantile.
·         Cumulative number of targets for quantile n is the number of true positive instances in the first n quantiles (defined as above).
·         Cumulative number of nontargets is the number of actually negative instances in the first n quantiles (defined as above).
·         Cumulative lift for a given quantile is the ratio of the cumulative target density to the target density over all the test data.
Cumulative targets can be computed from the quantities that are available in the LiftResultElement using the following formula:
targets_cumulative = lift_cumulative * percentage_records_cumulative
Oracle Data Miner calculates different statistics for lift. See the online help for Oracle Data miner for more information.

3.5.3 Receiver Operating Characteristics

Another useful method for evaluating classification models is Receiver Operating Characteristics (ROC) analysis. ROC curves are similar to lift charts in that they provide a means of comparison between individual models and determine thresholds which yield a high proportion of positive hits. ROC was originally used in signal detection theory to gauge the true hit versus false alarm ratio when sending signals over a noisy channel.
The horizontal axis of an ROC graph measures the false positive rate as a percentage. The vertical axis shows the true positive rate. The top left hand corner is the optimal location in an ROC curve, indicating high TP (true-positive) rate versus low FP (false-positive) rate. The area under the ROC curve (AUC) measures the discriminating ability of a binary classification model. The larger the AUC, the higher the likelihood that an actual positive case will be assigned a higher probability of being positive than an actual negative case. The AUC measure is especially useful for data sets with unbalanced target distribution (one target class dominates the other).
In the example graph in Figure 3-2, Model A clearly has a higher AUC for the entire data set. However, if the user decides that a false positive rate of 40% is acceptable, Model B is better suited, since it achieves a better error true positive rate at that false positive rate.
Figure 3-2 Receiver Operating Characteristics Curves
Besides model selection the ROC also helps to determine a threshold value to achieve an acceptable trade-off between hit (true positives) rate and false alarm (false positives) rate. By selecting a point on the curve for a given model a given trade-off is achieved. This threshold can then be used as a post-processing parameter for achieving the desired performance with respect to the error rates. ODM models by default use a threshold of 0.5.
The Oracle Data Mining ROC computation calculates the following statistics:
·         Probability threshold: The minimum predicted positive class probability resulting in a positive class prediction. Different threshold values result in different hit rates and false alarm rates.
·         True negatives: Negative cases in the test data with predicted probabilities strictly less than the probability threshold (correctly predicted).
·         True positives: Positive cases in the test data with predicted probabilities greater than or equal to the probability threshold (correctly predicted).
·         False negatives: Positive cases in the test data with predicted probabilities strictly less than the probability threshold (incorrectly predicted).
·         False positives: Negative cases in the test data with predicted probabilities greater than or equal to the probability threshold (incorrectly predicted).
·         Hit rate ("True Positives" in Oracle Data Miner): (true positives/(true positives + false negatives))
·         False alarm rate ("False Positives" in Oracle Data Miner): (false positives/(false positives + true negatives))

3.5.4 Test Metrics for Regression Models

Regression test results provide the following measures of model accuracy:
·         Root mean square
·         Mean absolute error
These two statistics are the metrics most commonly used to test regression models. For more information about these metrics, see the Oracle Data Mining Application Developer's Guide.

4 Unsupervised Data Mining

This chapter describes unsupervised models. These models do not predict a target value, but focus on the intrinsic structure, relations, and interconnectedness of the data. Unsupervised models are sometimes called descriptive models. Oracle Data Mining supports the following unsupervised functions:
·         Clustering
·         Association
·         Feature Extraction

4.1 Clustering

Clustering is useful for exploring data. If there are many cases and no obvious natural groupings, clustering data mining algorithms can be used to find natural groupings.
Clustering analysis identifies clusters embedded in the data. A cluster is a collection of data objects that are similar in some sense to one another. A good clustering method produces high-quality clusters to ensure that the inter-cluster similarity is low and the intra-cluster similarity is high; in other words, members of a cluster are more like each other than they are like members of a different cluster.
Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models. Clustering models are different from supervised models in that the outcome of the process is not guided by a known result, that is, there is no target attribute. Supervised models predict values for a target attribute, and an error rate between the target and predicted values can be calculated to guide model building. Clustering models, on the other hand, are built using optimization criteria that favor high intra-cluster and low inter-cluster similarity. The model can then be used to assign cluster identifiers to data points.
In ODM a cluster is characterized by its centroid, attribute histograms, and the cluster's place in the model's hierarchical tree. ODM performs hierarchical clustering using an enhanced version of the k-means algorithm and the orthogonal partitioning clustering algorithm, an Oracle proprietary algorithm. The clusters discovered by these algorithms are then used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the bounding boxes that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. For example, for a data set with two attributes: AGE and HEIGHT, the following rule represents most of the data assigned to cluster 10:
If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10
The clusters are also used to generate a Bayesian probability model which is used during scoring for assigning data points to clusters.

4.1.1 Algorithms for Clustering

Oracle Data Mining supports the following algorithms for clustering:
·         Enhanced k-Means Algorithm
For a brief comparison of the two clustering algorithms, see Table 4-1.

4.1.1.1 Enhanced k-Means Algorithm

The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters (provided there are enough distinct cases). Distance-based algorithms rely on a distance metric (function) to measure the similarity between data points. The distance metric is either Euclidean, Cosine, or Fast Cosine distance. Data points are assigned to the nearest cluster according to the distance metric used.
ODM implements an enhanced version of the k-means algorithm with the following features:
·         The algorithm builds models in a hierarchical manner. The algorithm builds a model top down using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-means algorithm. The centroid of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned.
·         The algorithm grows the tree one node at a time (unbalanced approach). Based on a user setting available in either of the programming interfaces, the node with the largest variance is split to increase the size of the tree until the desired number of clusters is reached.
·         The algorithm provides probabilistic scoring and assignment of data to clusters.
·         The algorithm returns, for each cluster, a centroid (cluster prototype), histograms (one for each attribute), and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes or the mean and variance for numerical attributes.
This approach to k-means avoids the need for building multiple k-means models and provides clustering results that are consistently superior to the traditional k-means.
4.1.1.1.1 Data for k-Means
The ODM implementation of k-Means supports both categorical and numerical data.
4.1.1.1.2 Data Preparation for k-Means
For numerical attributes, data normalization is recommended.
For the k-Means algorithm, NULL values indicate sparse data. Missing values are not automatically handled. If the data is not sparse and the values are indeed missing at random, you should perform missing data imputation (that is, perform some kind of missing values treatment) and substitute a non-NULL value for the NULL value. One simple way to treat missing values is to use the mean for numerical attributes and the mode for categorical attributes. If you do not treat missing values, the algorithm will not handle the data correctly.
4.1.1.1.3 Scoring (Applying Models)
The clusters discovered by enhanced k-Means are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The k-means algorithm can be interpreted as a mixture model where the mixture components are spherical multivariate normal distributions with the same variance for all components.

4.1.1.2 Orthogonal Partitioning Clustering (O-Cluster) Algorithm

The O-Cluster algorithm supports Orthogonal Partitioning Clustering; O-Cluster creates a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space. The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. A parameter called sensitivity defines a baseline density level. Only areas with peak density above this baseline level can be identified as clusters.
The k-means algorithm tessellates the space even when natural clusters may not exist. For example, if there is a region of uniform density, k-Means tessellates it into n clusters (where n is specified by the user). O-Cluster separates areas of high density by placing cutting planes through areas of low density. O-Cluster needs multi-modal histograms (peaks and valleys). If an area has projections with uniform or monotonically changing density, O-Cluster does not partition it.
4.1.1.2.1 O-Cluster Data Use
O-Cluster does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It will only read another batch if it believes, based on statistical tests, that there may still exist clusters that it has not yet uncovered.
Because O-Cluster may stop the model build before it reads all of the data, it is highly recommended that the data be randomized.
4.1.1.2.2 Binning for O-Cluster
The use of ODM's equi-width binning transformation with automated estimation of the required number of bins is highly recommended.
4.1.1.2.3 O-Cluster Attribute Type
Binary attributes should be declared as categorical.
4.1.1.2.4 O-Cluster Scoring
The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring (model apply) for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numerical attributes and multinomial distributions for categorical attributes.

4.1.1.3 Outliers and Clustering

The presence of outliers can significantly impact either type of clustering model. Use a clipping transformation before you bin or normalize the table to avoid the problems caused by outliers.

4.1.1.4 K-Means and O-Cluster Comparison

The main characteristics of the enhanced k-means and O-Cluster algorithms are summarized in Table 4-1.
Table 4-1 Clustering Algorithms Compared
Feature
Enhanced k-Means
O-Cluster
Clustering methodolgy
Distance-based
Grid-based
Number of cases
Handles data sets of any size
More appropriate for data sets that have more than 500 cases. Handles large tables via active sampling
Number of attributes
More appropriate for data sets with a low number of attributes
More appropriate for data sets with a high number of attributes
Number of clusters
User-specified
Automatically determined
Hierarchical clustering
Yes
Yes
Probablistic cluster assignment
Yes
Yes
Recommended data preparation
Normalization
Equi-width binning after clipping

4.2 Association

An Association model is often used for market basket analysis, which attempts to discover relationships or correlations in a set of items. Market basket analysis is widely used in data analysis for direct marketing, catalog design, and other business decision-making processes. A typical association rule of this kind asserts that, for example, "70% of the people who buy spaghetti, wine, and sauce also buy garlic bread."
Association models capture the co-occurrence of items or events in large volumes of customer transaction data. Because of progress in bar-code technology, it is now possible for retail organizations to collect and store massive amounts of sales data. Association models were initially defined for such sales data, even though they are applicable in several other applications. Finding association rules is valuable for cross-marketing and mail-order promotions, but there are other applications as well: catalog design, add-on sales, store layout, customer segmentation, web page personalization, and target marketing.
Traditionally, association models are used to discover business trends by analyzing customer transactions. However, they can also be used effectively to predict Web page accesses for personalization. For example, assume that after mining the Web access log, Company X discovered an association rule "A and B implies C," with 80% confidence, where A, B, and C are Web page accesses. If a user has visited pages A and B, there is an 80% chance that he/she will visit page C in the same session. Page C may or may not have a direct link from A or B. This information can be used to create a dynamic link to page C from pages A or B so that the user can "click-through" to page C directly. This kind of information is particularly valuable for a Web server supporting an e-commerce site to link the different product pages dynamically, based on the customer interaction.
There are several properties of association models that can be calculated. ODM calculates the following two properties related to rules:
·         Support: Support of a rule is a measure of how frequently the items involved in it occur together. Using probability notation, support (A implies B) = P(A, B).
·         Confidence: Confidence of a rule is the conditional probability of B given A; confidence (A implies B) = P (B given A).
These statistical measures can be used to rank the rules and hence the predictions.
ODM uses the Apriori algorithm for association models.

4.2.1 Data for Association Models

Association models are designed to use sparse data. Sparse data is data for which only a small fraction of the attributes are non-zero or non-null in any given row. Examples of sparse data include market basket and text mining data. For example, in a market basket problem, there might be 1,000 products in the company's catalog, and the average size of a basket (the collection of items that a customer purchases in a typical transaction) might be 20 products. In this example, a transaction/case/record has on average 20 out of 1000 attributes that are not null. This implies that the fraction of non-zero attributes on the table (or the density) is 20/1000, or 2%. This density is typical for market basket and text processing problems. Data that has a significantly higher density can require extremely large amounts of temporary space to build associations.
Association models treat NULL values as sparse data. The algorithm does not handle missing values. If the data is not sparse and the NULL values are indeed missing at random, you should perform missing data imputation (that is, treat the missing values) and substitute non-null values for the NULL value.
The presence of outliers, when external equal-width binning is used, makes most of the data concentrate in a few bins (a single bin in extreme cases). As a result, the ability of the model to detect differences in numerical attributes may be significantly lessened. For example, a numerical attribute such as income may have all the data belonging to a single bin except for one entry (the outlier) that belongs to a different bin. As a result, there won't be any rules reflecting different levels of income. All rules containing income will only reflect the range in the single bin; this range is basically the income range for the whole population. In cases like this, use a clipping transformation to handle outliers.

4.2.2 Difficult Cases for Associations

The Apriori algorithm works by iteratively enumerating item sets of increasing lengths subject to the minimum support threshold. Since state-of-the-art algorithms for associations work by iterative enumeration, association rules algorithms do not handle the following cases efficiently:
·         Finding associations involving rare events
·         Finding associations in data sets that are dense and that have a large number of attributes.

4.2.2.1 Finding Associations Involving Rare Events

Association mining discovers patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. However, doing so could potentially explode the number of enumerated item sets, especially in cases with large number of items. That could increase the execution time significantly.
Therefore, association rule mining is not recommended for finding associations involving rare events in problem domains with a large number of items.
One option is to use classification models in such problem domains.

4.2.3 Algorithm for Associations

Associations are calculated using the Apriori algorithm. The association mining problem can be decomposed into two subproblems:
·         Find all combinations of items, called frequent itemsets, whose support is greater than the specified minimum support.
·         Use the frequent itemsets to generate the desired rules. ODM Association supports single consequent rules only (for example, "ABC implies D").
The number of frequent itemsets is controlled by the minimum support parameters. The number of rules generated is controlled by the number of frequent itemsets and the confidence parameter. If the confidence parameter is set too high, there may be frequent itemsets in the association model but no rules.

4.3 Feature Extraction

Feature Extraction creates a set of features based on the original data. A feature is a combination of attributes that is of special interest and captures important characteristics of the data. It becomes a new attribute. Typically, there are far fewer features than there are original attributes.
Some applications of feature extraction are latent semantic analysis, data compression, data decomposition and projection, and pattern recognition. Feature extraction can also be used to enhance the speed and effectiveness of supervised learning.
For example, feature extraction can be used to extract the themes of a document collection, where documents are represented by a set of key words and their frequencies. Each theme (feature) is represented by a combination of keywords. The documents in the collection can then be expressed in terms of the discovered themes.

4.3.1 Algorithm for Feature Extraction

ODM uses the Non-Negative Matrix nFactorization (NMF) algorithm for feature extraction. Non-Negative Matrix Factorization (NMF) is described in the paper "Learning the Parts of Objects by Non-Negative Matrix Factorization" by D. D. Lee and H. S. Seung in Nature(401, pages 788-791, 1999). Non-Negative Matrix Factorization is a feature extraction algorithm that decomposes multivariate data by creating a user-defined number of features, which results in a reduced representation of the original data. NMF decomposes a data matrix V into the product of two lower rank matrices W and H so that V is approximately equal to W times H. NMF uses an iterative procedure to modify the initial values of W and H so that the product approaches V. The procedure terminates when the approximation error converges or the specified number of iterations is reached. Each feature is a linear combination of the original attribute set; the coefficients of these linear combinations are non-negative. During model apply, an NMF model maps the original data into the new set of attributes (features) discovered by the model.

4.3.1.1 NMF for Text Mining

Text mining involves extracting information from unstructured data. Typically, text data is high-dimensional and sparse. Unsupervised algorithms like Principal Components Analysis (PCA), Singular Value Decomposition (SVD), and NMF involve factoring the document-term matrix based on different constraints. One widely used approach for text mining is latent semantic analysis. NMF focuses on reducing dimensionality. By comparing the vectors for two adjoining segments of text in a high-dimensional semantic space, NMF provides a characterization of the degree of semantic relatedness between the segments. NMF is less complex than PCA and can be applied to sparse data. NMF-based latent semantic analysis is an attractive alternative to SVD approaches due to the additive non-negative nature of the solution and the reduced computational complexity and resource requirements.

4.3.1.2 Data Preparation for NMF

The presence of outliers can significantly impact NMF models. Use a clipping transformation before you bin or normalize the table to avoid the problems caused by outliers for these algorithms.
NULL values are treated as missing and not as indicators of sparse data.
NMF may benefit from normalization.

5 Data Mining Process

This chapter describes the data mining process in general and how it is supported by Oracle Data Mining. Data mining requires data preparation, model building, model testing and computing lift for a model, model applying (scoring), and model deployment. The Oracle database and Oracle Data Mining provide facilities for performing all of the data mining steps. This chapter discusses the following topics:
·         How Is Data Mining Done?

5.1 How Is Data Mining Done?

CRISP-DM is a widely accepted methodology for data mining projects. For details, see htttp://www.crisp-dm.org. The steps in the process are:
1.      Business Understanding: Understand the project objectives and requirements from a business perspective, and then convert this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.
2.      Data Understanding: Start by collecting data, then get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses about hidden information.
3.      Data Preparation: Includes all activities required to construct the final data set (data that will be fed into the modeling tool) from the initial raw data. Tasks include table, case, and attribute selection as well as transformation and cleaning of data for modeling tools.
4.      Modeling: Select and apply a variety of modelling techniques, and calibrate tool parameters to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
5.      Evaluation: Thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. Determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results is reached.
6.      Deployment: Organize and present the results of data mining. Deployment can be as simple as generating a report or as complex as implementing a repeatable data mining process.
Data mining is iterative. A data mining process continues after a solution is deployed. The lessons learned during the process can trigger new business questions. Changing data can require new models. Subsequent data mining processes benefit from the experiences of previous ones.
Oracle Data Mining (ODM) supports the last three steps of CRISP-DM process. The first step, Business Understanding, is unique to your business. The remaining steps are supported by a combination of ODM and the Oracle database, especially in the context of an Oracle data warehouse. The facilities of the Oracle database can be very useful during data understanding and data preparation.

5.2 How Does Oracle Data Mining Support Data Mining?

ODM integrates data mining with the Oracle database and exposes data mining through the following interfaces:
·         Java interface: Java Data Mining (JSR-73) compliant interface that allows users to embed data mining in Java applications.
·         PL/SQL interface: The packages DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM allow users to embed data mining in PL/SQL applications.
·         Automated data mining: The DBMS_PREDICTIVE_ANALYTICS PL/SQL package, described briefly in "Automated Data Mining", automates the entire data mining process from data preprocessing through model building to scoring data.
·         Data mining SQL functions: The SQL Data Mining functions (CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET, FEATURE_ID, FEATURE_SET, FEATURE_VALUE, PREDICTION, PREDICTION_COST, PREDICTION_DETAILS, PREDICTION_PROBABILITY, and PREDICTION_SET) support deployment of models within the context of existing applications, improve scoring performance, and enable pipelining of results involving data mining predictions. For more information, see "Data Mining Functions".
·         Graphical interfaces: Oracle Data Miner and Oracle Spreadsheet Add-In for Predictive Analytics are graphical interfaces that solve data mining problems. See "Graphical Interfaces" for a brief overview.
The end result of data mining is a model. Often this model is deployed so that its results can be embedded in an application. ODM provides facilities for deployment described in "Model Deployment".

5.2.1 Java and PL/SQL Interfaces

The Java and PL/SQL programmatic interfaces provide the facilities to do basic data preparation (binning, normalization, winsorizing, clipping, and missing values treatment.) The two interfaces also provide calls that build, test, and apply the models described inChapter 3 and Chapter 4.
The ODM Java interface and the ODM PL/SQL interface have the same capabilities. Models produced by either interface are interoperable, for example, a model can be built using one interface and applied using the other interface.

5.2.2 Automated Data Mining

The PL/SQL package DBMS_PREDICTIVE_ANALYTICS automates the data mining process from data preprocessing through model building to scoring new data. This automation provides a simple and intuitive interface. The package provides an important tool that simplifies data mining for users who are not data mining experts.
DBMS_PREDICTIVE_ANALYTICS provides the following functionality:
·         EXPLAIN - Rank attributes in order of influence in explaining a target column
·         PREDICT - Predict the value of an attribute
For detailed information about DBMS_PREDICTIVE_ANALYTICS, see the Oracle Database PL/SQL Packages and Types Reference.
The Oracle Spreadsheet Add-In for Predictive Analytics provides a graphical user interface to DBMS_PREDICTIVE_ANALYTICS; the Add-In is briefly described in "Graphical Interfaces".

5.2.3 Data Mining Functions

The data mining functions are SQL functions that apply existing ODM models; they also return information about existing ODM models. The functions are as follows:
·         CLUSTER_ID: Returns the cluster identifier of the predicted cluster with the highest probability for a specified set of predictors.
·         CLUSTER_PROBABILITY: Returns a measure of the degree of confidence of membership of an input row in a cluster associated with the specified model.
·         CLUSTER_SET: Returns a varray of objects containing all possible clusters and the probabilities for the returned clusters that a given row belongs to subject to certain filtering criteria.
·         FEATURE_ID: Returns the identifier of the feature (in a feature extraction model) with the highest coefficient value.
·         FEATURE_SET: Returns a varray of objects containing all possible features and the feature values in a feature extraction model subject to certain filtering criteria.
·         FEATURE_VALUE: Returns the value of a given feature in a feature extraction model.
·         PREDICTION: Returns the best prediction for a classification or regression model given a set of predictors.
·         PREDICTION_COST: Returns the cost for a prediction made using an ODM classification model. Applies to decision tree models only if a cost matrix was specified during build.
·         PREDICTION_DETAILS: Returns an XML string containing model-specific information about the scoring by an ODM decision tree or NB model of an input row.
·         PREDICTION_PROBABLILITY: Returns the probability for a prediction made using an ODM classification model.
·         PREDICTION_SET: Returns a varray of objects that contains all classes, the probability for each class, the costs for a decision tree model in a classification prediction.
The data mining functions have many benefits, the most important of which are the following:
·         The functions make deployment of models within the context of existing applications straightforward since existing SQL statements can be easily enhanced with them.
·         The functions greatly improve scoring (model apply) performance.
·         The functions enable pipelining of results involving data mining predictions; this permits, among other things, the ability to return a few results quickly to an end user.
For more information about the SQL data mining functions, see the Oracle Database SQL Reference.

5.2.4 Graphical Interfaces

ODM has two graphical interfaces, both of which are available as downloads from Oracle Technology Network:
·         Oracle Data Miner is a user interface to ODM that helps data analysts and application developers build advanced business intelligence applications based on ODM. ODM Java Code Generator is an extension to Oracle JDeveloper that exports models created using Oracle Data Miner to Java code.
·         Oracle Spreadsheet Add-In for Predictive Analytics enables Microsoft Excel users to mine data in Oracle tables or Excel spreadsheets using the features of the DBMS_PREDICTIVE_ANALYTICS PL/SQL package.

5.2.5 Model Deployment

It is common to build models on one system and then deploy the models to a production system. The ODM Scoring Engine, described in Chapter 7, supports common deployment scenarios.
ODM supports data mining model export and import in native format between Oracle databases or schemas to provide a way to move models.
Model export/import is supported at different levels, as follows:
·         Database export/import. When a DBA exports a full database using the expdp utility, all the existing data mining models in the database will be exported. When a DBA imports a database dump using the impdp utility, all the data mining models in the dump will be restored.
·         Schema export/import. When a user or DBA exports a schema using expdp, all the data mining models in the schema will be exported. When the user or DBA imports the schema dump using impdp, all the models in the dump will be imported.
·         Selected model export/import. Both ODM interfaces include calls that export or import specific models, for example, the PL/SQL interface includes DBMS_DATA_MINING.export_model() and DBMS_DATA_MINING.import_model().
ODM model export and model import are based on the Oracle DBMS Data Pump. When you export a model, the tables that constitute the model and the associated metadata are written to a dump file set that consists of one or more files. When you import a model, the tables and metadata are retrieved from the file and restored in the new database.
For detailed information about model export/import, see Oracle Data Mining Administrator's Guide.

6 Text Mining Using Oracle Data Mining

This chapter describes support for text mining. Oracle provides support for text mining with two products:
·         Oracle Data Mining (ODM)
·         Oracle Text
The support for text data in ODM is different from that provided by Oracle Text. Oracle Text is dedicated to text document processing. ODM allows the combination of text (unstructured) columns and non-text (categorical and numerical) columns of data as input for clustering, classification, and feature extraction.
Oracle Text is described in the Oracle Text Reference and the Oracle Text Application Developer's Guide.
Text data is the only unstructured data type supported by ODM.
Table 6-1 summarizes how DBMS_DATA_MINING, the ODM Java interface, and Oracle Text support text mining.
Oracle Data Mining Application Developer's Guide provides information that helps you develop text mining applications using the PL/SQL and Java ODM interfaces. Oracle Data Mining Administrator's Guide contains descriptions of the sample text mining programs included with ODM.
This chapter discusses the following topics:
·         What is Text Mining?
·         ODM Support for Text Mining
·         Oracle Support for Text Mining

6.1 What is Text Mining?

Text mining is conventional data mining done using "text features." Text features are usually keywords, frequencies of words, or other document-derived features. Once you derive text features, you mine them just as you would any other data.
Some of the applications for text mining include:
·         Create and manage taxonomies
·         Classify or categorize documents
·         Integrate search capabilities with classification and clustering of documents returned from a search
·         Extract topics automatically
·         Extract features for subsequent mining

6.1.1 Document Classification

Document classification, also known as document categorization, is the process of assigning documents to categories (for example, themes or subjects). A particular document may fit into two or more different categories. This type of classification can often be represented as a multi target classification problem where a supervised model is built for each category.

6.2 ODM Support for Text Mining

ODM provides infrastructure for developing data mining applications suitable for addressing a variety of business problems involving text. Among these, the following specific technologies provide key elements for addressing problems that require text mining:
·         Classification
·         Clustering
·         Feature extraction
·         Association
·         Regression
·         Anomaly Detection
The technologies that are most used in text mining are classification, clustering, and feature extraction.

6.2.1 Classification and Text Mining

A large number of document classification applications fall into one of the following:
·         Assigning multiple labels to a document. ODM does not support this case.
·         Assigning a document to one of many labels. For example, automatically assigning a mail message to a folder and spam filtering. This application requires multi-class classification.
The Support Vector Machine (SVM) algorithm provides powerful classifiers that have been used successfully in document classification applications. SVM can deal with thousands of features and is easy to train with small or large amounts of data. SVM is known to work well with text data. For more information about SVM, see Chapter 3.

6.2.2 Clustering and Text Mining

Clustering is used frequently in text mining; the main applications of clustering in text mining are
·         Taxonomy generation
·         Topic extraction
·         Grouping the hits returned by a search engine
Clustering can also be used to group textual information with other indications from business databases to provide novel insights.
The current release of ODM supports clustering text features using both the PL/SQL and Java interfaces.
The k-Means clustering algorithm, described in "Enhanced k-Means Algorithm", supports mining text columns.

6.2.3 Feature Extraction and Text Mining

There are two kinds of text mining problems for which feature extraction is useful:
·         Extract features from actual text. Oracle Text is designed to solve this kind of problem. ODM also supports feature extraction from text. Most text mining is focused on this problem.
·         Extract semantic features or higher-level features from the basic features uncovered when features are extracted from actual text. Statistical techniques, such as single value decomposition (SVD) and non-negative matrix factorization (NMF), are important in solving this kind of problem. Higher-order features can greatly improve the quality of information retrieval, classification, and clustering tasks.
NMF has been found to provide superior text retrieval when compared to SVD and other traditional decomposition methods. NMF takes as input a term-document matrix and generates a set of topics that represent weighted sets of co-occurring terms. The discovered topics form a basis that provides an efficient representation of the original documents. For more information about NMF, see "Algorithm for Feature Extraction".

6.2.4 Association and Text Mining

Association models can be used to uncover the semantic meaning of words. For example, suppose that the word sheep co-occurs with words like sleep, fence, chew, grass, meadow, farmer, and shear. An association model would include rules connecting sheep with these concepts. Inspection of the rules would provide context for sheep in the document collection. Such associations can improve information retrieval engines. For more information about association models, see "Association".

6.2.5 Regression and Text Mining

Regression is most often used in problems that combine text with other types of data. For more information about regression, see"Regression".

6.2.6 Anomaly Detection and Text Mining

One-Class Support Vector Machine can be used to detect or identify novel or anomalous patterns. For more information, see"Anomaly Detection" .

6.3 Oracle Support for Text Mining

Table 6-1 summarizes how the ODM (both the Java and PL/SQL interfaces) and Oracle Text support text mining functions.
Table 6-1 Text Mining Comparison
Feature
ODM
Oracle Text
Association
Text data only or text and non-text data
No support
Clustering
k-Means algorithm supports text only or text and non-text data
k-means algorithm supports text only
Attribute importance
No support for text data
No support
Regression
Support Vector Machine (SVM) algorithm supports text data only or text and non-text data
No support
Classification
SVM supports text only or text and non-text data
Support for assigning documents to one of many labels
SVM and decision trees support text only
Support for assigning documents to one of many labels and also for assigning documents to multiple labels at the same time
One-Class SVM
One-Class SVM supports text only or text and non-text data.
No support
Feature extraction (basic features)
The Java API handles the process supports the feature extraction process that transforms a text column to a nested table. The PL/SQL API requires the use of Oracle Text procedures to perform extraction. ODM allows the same degree of control as Oracle Text
Feature extraction is done internally; the results are not exposed
Feature extraction (higher order features)
Non-negative matrix factorization (NMF) supports either text or text and non-text data
No support
Record apply
No support for record apply
Supports record apply for classification
Support for text columns
Features extracted from a column of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW using an appropriate transformation
Supports table columns of type CLOB, BLOB, BFILE. LONG, VARCHAR2, XMLType, CHAR, RAW, LONG RAW

7 Oracle Data Mining Scoring Engine

This chapter describes one way to deploy the results of data mining. Some data-mining enabled applications have models that are developed on one system and then deployed to other (production) systems. Production applications often need only to apply models built elsewhere. The Oracle Data Mining Scoring Engine supports scoring data (applying models) using models created by other instances of Oracle Data Mining.
The Scoring Engine allows customers to limit the Oracle Data Mining (ODM) functionality available within their scoring applications to ensure that compute-intensive operations such as model building are not performed on production systems where data scoring is performed.
This chapter discusses the following topics:
·         ODM Scoring Engine Features
·         Moving Data Mining Models

7.1 ODM Scoring Engine Features

The ODM Scoring Engine supports operations for preparing data as required from the build process, importing a model, and applying a model to data. All transformation functionality is included in the Scoring Engine. All functionality provided in the Scoring Engine behaves exactly as the same functionality in the full system.
You cannot build models using the Scoring Engine.

7.2 ODM Scoring Engine Installation

To install the ODM Scoring Engine, select during installation the ÒData Mining Scoring EngineÓ option, a custom install option for Oracle Data Mining.

7.3 Scoring in Data Mining Applications

A single model can be used to score large volumes of data, often in multiple geographically distributed application settings. Data analysis and model building might be performed by a small group of data mining experts using data from a centralized data warehouse. However, the model can be used by a much larger number of applications working with data at geographically dispersed sites using local data. Local data may consist of millions of records representing customers; therefore, it can make sense to move the model to where the data is.
In real-time applications such as call centers, models are often built in one environment and used in another. There may be one machine dedicated to model building, using large volumes of data to produce models on a daily basis. Several other machines may be dedicated to real-time scoring, receiving new models to support, for example, the call center application. Call center representatives collect information from callers; the collected information is then used to obtain predictions or recommendations for that particular caller in real time. Scoring in real time often requires that the model is moved to where the data is.

7.4 Moving Data Mining Models

Oracle Data Mining supports data movement from one schema or database instance to another with native export and import.
Naive model export and import is based on Oracle Data Pump technology.
Native export is supported at three different levels, as follows:
·         When a full database is exported using the Oracle Data Pump Export Utility (expdp), all data mining models in the database are exported.
·         When a schema is exported using the Oracle Data Pump Export Utility (expdp), all data mining models in the schema are exported.
·         An ODM user can export one or more specific models in a schema using the PL/SQLDBMS_DATA_MINING.export_model procedure or the Java task javax.datamining.task.ExportTask.
Native import is also supported in all scenarios. Using the dump file set produced by the Oracle Data Pump Export Utility (expdp), an ODM user can run the Oracle Data Pump Import Utility (impdp) to import all data mining models contained in the dump file set.
ODM users can import a specific model from the dump file set using the PL/SQL DBMS_DATA_MINING.import_model procedure or the Java javax.datamining.task.ImportTask.
You can also perform native export and import using the ODM Java interface; for an example, see the dmexpimpdemo.javasample program.
For more information about native export and import, see the Oracle Data Mining Administrator's Guide, the DBMS_DATA_MINING chapter in the Oracle Database PL/SQL Packages and Types Reference, and the Oracle Data Mining Application Developer's Guide.

7.5 Using the Oracle Data Mining Scoring Engine

Suppose that the application builds a model on one system named BUILDSYS and applies the model on a different system named SCORESYS.
BUILDSYS must have ODM installed. ODM supports all data mining activities (building models, testing models, applying models, etc.). SCORESYS has the ODM Scoring Engine installed. It will not be possible to build models on SCORESYS, but it will be possible to apply the model. Note that SCORESYS could have a full ODM product installation, but model building may interfere with scoring performance.
The following processing takes place:
1.      Build models on BUILDYSYS. Select the appropriate model to deploy.
2.      Export the model using the ODM PL/SQL or Java interface.
3.      Copy the dump file generated by model export to SCORESYS.
4.      Import the model on SCORESYS using the ODM PL/SQL or Java interface.
5.      Apply the model to data on SCORESYS.

8 Sequence Similarity Search and Alignment (BLAST)

This chapter describes Oracle Data Mining support for certain problems in the life sciences. In addition to data mining functions that produce supervised and unsupervised models, ODM supports the sequence similarity search and alignment algorithm Basic Local Alignment Search Tool (BLAST). In life sciences, vast quantities of data including nucleotide and amino acid sequences are stored,typically in a database. These sequence data help biologists determine the chemical structure, biological function, and evolutionary history of organisms. A key feature of managing the exponential growth in sequence data sources is the availability of fast, sensitive, and statistically rigorous techniques for detecting similarities between these sequences.
As the amount of nucleotide and amino acid sequence data continues to grow, the data becomes increasingly useful in the analysis of newly sequenced genes and proteins because of the greater chance of finding such sequence similarities.
This chapter discusses the following topics:
·         BLAST in the Oracle Database
For detailed information about Oracle's implementation of BLAST, see Oracle Data Mining Application Developer's Guide.

8.1 Bioinformatics Sequence Search and Alignment

Sequence alignment is one of the most common bioinformatics tasks. It is present in almost any research and development activity across the many industries in the area of life sciences including academia, biotech, services, software, pharmaceutical companies, and hospitals.
The National Center for Biotechnology Information (NCBI) developed one of the commonly used versions of the Basic Local Alignment Search Tool (BLAST).
Of all the sequence alignment algorithms, the one that is most widely used is BLAST. It is typically used to compare one query nucleotide or protein sequence against a database of sequences, and uncover similarities and sequence matches. Its success and popularity comes from its combination of speed, sensitivity, and statistical assessment of the results.
BLAST is a heuristic method to find the high-scoring locally optimal alignments between a query sequence and a database. The BLAST algorithm and family of programs rely on the statistics of gapped and un-gapped sequence alignments. The statistics allow the probability of obtaining an alignment with a particular score to be estimated. BLAST is unlikely to be as sensitive for all protein searches as a full dynamic programming algorithm. However, the underlying statistics provide a direct estimate of the significance of any match found.
The inclusion of BLAST in ODM positions the Oracle database as a platform for bioinformatics.
For more information about BLAST, see the NCBI BLAST Home Page at http://www.ncbi.nlm.nih.gov/BLAST.

8.2 BLAST in the Oracle Database

Implementing BLAST in the database provides the following benefits:
·         You can include BLAST in complex queries, thereby enabling complex analytical pipelines that include BLAST searches.
·         You can subselect portions of the database using SQL, thereby restricting searches.
·         Since sequence data is already stored in the database, it is not necessary to export the sequence data and pre-process them to create BLAST data sets and then import the results back into the database.

8.3 Oracle Data Mining Sequence Search and Alignment Capabilities

Sequence search and alignment, with capabilities similar to those of NCBI BLAST 2.0, has been implemented in the database using table functions. This implementation enables users to perform queries against data that is held directly inside an Oracle database. As the algorithms are implemented as table functions, parallel computation is intrinsically supported.
The five core variants of BLAST have been implemented:
·         BLASTN compares a nucleotide query sequence against a nucleotide database.
·         BLASTP compares a protein query sequence against a protein sequence database.
·         BLASTX compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
·         TBLASTN compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands).
·         TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The BLAST table functions are implemented in the database, and can be invoked by SQL. Using SQL, it is possible to pre-process the sequences as well as perform any required post-processing. This additional processing capability means it is possible to combine BLAST searches with queries that involve images, date functions, literature search, etc. Using these complex queries make it possible to perform BLAST searches on a required subset of data, potentially resulting in highly performant queries. This functionality is expected to provide considerable value.
The performance of BLAST searches can be improved if the data set is transformed into a compressed binary format and used in the searches. BLASTN_COMPRESS() compresses nucleotide sequence data. The BLAST queries can use this compressed format for searches.
BLAST queries can be invoked directly using the SQL interface or through an application. The query below shows an example of a SQL-invoked BLAST search where a protein sequence is compared with the protein database SwissProt, and sequences are filtered so that only human sequences that were deposited after 1 January 1990 are searched against. The column of numbers at the end of the query reflects the parameters chosen.
select t_seq_id, alignment_length, q_seq_start, q_seq_end 
       q_frame, t_seq_start, t_seq_end, t_frame, score, expect 
  from TABLE( 
       BLASTP_ALIGN ( 
         (select sequence from query_db), 
         CURSOR(SELECT seq_id, seq_data 
                FROM swissprot 
                WHERE organism = 'Homo sapiens (Human)' AND 
                      creation_date > '01-Jan-90'), 
         1, 
         -1, 
         0, 
         0, 
         'BLOSUM62', 
         10, 
         0, 
         0, 
         0, 
         0, 
         0) 
       ); 
 
The results of a SQL query can be displayed either through an application or a SQL*Plus interface. When the SQL*Plus interface is used, the user can decide how the results will be displayed. The following shows an example of the format of the output that could be displayed by the SQL query shown above.
T_SEQ_ID  ALIGNMENT_LENGTH  Q_SEQ_START  Q_SEQ_END  Q_FRAME  T_SEQ_START  T_SEQ_END T_FRAME  SCORE  EXPECT
----------
P31946         50            0          50         0         13          63          0      205    5.1694E-18
 
Q04917         50            0          50         0         12          62          0      198    3.3507E-17
 
P31947         50            0          50         0         12          62          0      169    7.7247E-14
 
P27348         50            0          50         0         12          62          0      198    3.3507E-17
 
P58107         21            30         51         0        792         813          0      94     6.34857645
 
The first row of information represents some of the attributes returned by the query: the target sequence ID, the length of the alignment, the position where the alignment starts on the query sequence, the position where the alignment ends on the query sequence, which open reading frame was used for the query sequence, among others.
The next five rows represent sequence alignments that were returned; for example, the protein with the highest alignment to query sequence has the accession number "P31946", the alignment length was 50 amino acids, the alignment started at the first base of the amino acid query, and ended with the 50th base of the amino acid query.

Slider

Image Slider By engineerportal.blogspot.in The slide is a linking image  Welcome to Engineer Portal... #htmlcaption

Tamil Short Film Laptaap

Tamil Short Film Laptaap
Laptapp

Labels

About Blogging (1) Advance Data Structure (2) ADVANCED COMPUTER ARCHITECTURE (4) Advanced Database (4) ADVANCED DATABASE TECHNOLOGY (4) ADVANCED JAVA PROGRAMMING (1) ADVANCED OPERATING SYSTEMS (3) ADVANCED OPERATING SYSTEMS LAB (2) Agriculture and Technology (1) Analag and Digital Communication (1) Android (1) Applet (1) ARTIFICIAL INTELLIGENCE (3) aspiration 2020 (3) assignment cse (12) AT (1) AT - key (1) Attacker World (6) Basic Electrical Engineering (1) C (1) C Aptitude (20) C Program (87) C# AND .NET FRAMEWORK (11) C++ (1) Calculator (1) Chemistry (1) Cloud Computing Lab (1) Compiler Design (8) Computer Graphics Lab (31) COMPUTER GRAPHICS LABORATORY (1) COMPUTER GRAPHICS Theory (1) COMPUTER NETWORKS (3) computer organisation and architecture (1) Course Plan (2) Cricket (1) cryptography and network security (3) CS 810 (2) cse syllabus (29) Cyberoam (1) Data Mining Techniques (5) Data structures (3) DATA WAREHOUSING AND DATA MINING (4) DATABASE MANAGEMENT SYSTEMS (8) DBMS Lab (11) Design and Analysis Algorithm CS 41 (1) Design and Management of Computer Networks (2) Development in Transportation (1) Digital Principles and System Design (1) Digital Signal Processing (15) DISCRETE MATHEMATICS (1) dos box (1) Download (1) ebooks (11) electronic circuits and electron devices (1) Embedded Software Development (4) Embedded systems lab (4) Embedded systems theory (1) Engineer Portal (1) ENGINEERING ECONOMICS AND FINANCIAL ACCOUNTING (5) ENGINEERING PHYSICS (1) english lab (7) Entertainment (1) Facebook (2) fact (31) FUNDAMENTALS OF COMPUTING AND PROGRAMMING (3) Gate (3) General (3) gitlab (1) Global warming (1) GRAPH THEORY (1) Grid Computing (11) hacking (4) HIGH SPEED NETWORKS (1) Horizon (1) III year (1) INFORMATION SECURITY (1) Installation (1) INTELLECTUAL PROPERTY RIGHTS (IPR) (1) Internal Test (13) internet programming lab (20) IPL (1) Java (38) java lab (1) Java Programs (28) jdbc (1) jsp (1) KNOWLEDGE MANAGEMENT (1) lab syllabus (4) MATHEMATICS (3) Mechanical Engineering (1) Microprocessor and Microcontroller (1) Microprocessor and Microcontroller lab (11) migration (1) Mini Projects (1) MOBILE AND PERVASIVE COMPUTING (15) MOBILE COMPUTING (1) Multicore Architecute (1) MULTICORE PROGRAMMING (2) Multiprocessor Programming (2) NANOTECHNOLOGY (1) NATURAL LANGUAGE PROCESSING (1) NETWORK PROGRAMMING AND MANAGEMENT (1) NETWORKPROGNMGMNT (1) networks lab (16) News (14) Nova (1) NUMERICAL METHODS (2) Object Oriented Programming (1) ooad lab (6) ooad theory (9) OPEN SOURCE LAB (22) openGL (10) Openstack (1) Operating System CS45 (2) operating systems lab (20) other (4) parallel computing (1) parallel processing (1) PARALLEL PROGRAMMING (1) Parallel Programming Paradigms (4) Perl (1) Placement (3) Placement - Interview Questions (64) PRINCIPLES OF COMMUNICATION (1) PROBABILITY AND QUEUING THEORY (3) PROGRAMMING PARADIGMS (1) Python (3) Question Bank (1) question of the day (8) Question Paper (13) Question Paper and Answer Key (3) Railway Airport and Harbor (1) REAL TIME SYSTEMS (1) RESOURCE MANAGEMENT TECHNIQUES (1) results (3) semester 4 (5) semester 5 (1) Semester 6 (5) SERVICE ORIENTED ARCHITECTURE (1) Skill Test (1) software (1) Software Engineering (4) SOFTWARE TESTING (1) Structural Analysis (1) syllabus (34) SYSTEM SOFTWARE (1) system software lab (2) SYSTEMS MODELING AND SIMULATION (1) Tansat (2) Tansat 2011 (1) Tansat 2013 (1) TCP/IP DESIGN AND IMPLEMENTATION (1) TECHNICAL ENGLISH (7) Technology and National Security (1) Theory of Computation (3) Thought for the Day (1) Timetable (4) tips (4) Topic Notes (7) tot (1) TOTAL QUALITY MANAGEMENT (4) tutorial (8) Ubuntu LTS 12.04 (1) Unit Wise Notes (1) University Question Paper (1) UNIX INTERNALS (1) UNIX Lab (21) USER INTERFACE DESIGN (3) VIDEO TUTORIALS (1) Virtual Instrumentation Lab (1) Visual Programming (2) Web Technology (11) WIRELESS NETWORKS (1)

LinkWithin