Thursday, May 7, 2009

DMDW FAQS with solutions for JNTU BTECH and MCA students unit 1

1. Define Data Mining. What are the fundamentals of Data mining?

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations

Fundamentals of Data mining:
Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:
Massive data collection
Powerful multiprocessor computers
Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group survey of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level, while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail, these numbers can be much larger. The accompanying need for improved computational engines can now be met in a cost-effective manner with parallel multiprocessor computer technology. Data mining algorithms embody techniques that have existed for at least 10 years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.
In the evolution from business data to business information, each new step has built upon the previous one. For example, dynamic data access is critical for drill-through in data navigation applications, and the ability to store large databases is critical to data mining. From the user’s point of view, the four steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately and quickly.
2. Define KDD. What are the steps in KDD.
Knowledge discovery is defined as ``the non-trivial extraction of implicit, unknown, and potentially useful information from data''. a clear distinction between data mining and knowledge discovery is drawn. Under their conventions, the knowledge discovery process takes the raw results from data mining (the process of extracting trends or patterns from data) and carefully and accurately transforms them into useful and understandable information. This information is not typically retrievable by standard techniques but is uncovered through the use of AI techniques.
Although there are many approaches to KDD, six common and essential elements qualify each as a knowledge discovery technique. The following are basic features that all KDD techniques share:
All approaches deal with large amounts of data
Efficiency is required due to volume of data
Accuracy is an essential element
All require the use of a high-level language
All approaches use some form of automated learning
All produce some interesting results
Large amounts of data are required to provide sufficient information to derive additional knowledge. Since large amounts of data are required, processing efficiency is essential. Accuracy is required to assure that discovered knowledge is valid. The results should be presented in a manner that is understandable by humans. One of the major premises of KDD is that the knowledge is discovered using intelligent learning techniques that sift through the data in an automated process. For this technique to be considered useful in terms of knowledge discovery, the discovered knowledge must be interesting; that is, it must have potential value to the user.
KDD provides the capability to discover new and meaningful information by using existing data. KDD quickly exceeds the human capacity to analyze large data sets. The amount of data that requires processing and analysis in a large database exceeds human capabilities, and the difficulty of accurately transforming raw data into knowledge surpasses the limits of traditional databases. Therefore, the full utilization of stored data depends on the use of knowledge discovery techniques.
The usefulness of future applications of KDD is far-reaching. KDD may be used as a means of information retrieval, in the same manner that intelligent agents perform information retrieval on the web. New patterns or trends in data may be discovered using these techniques. KDD may also be used as a basis for the intelligent interfaces of tomorrow, by adding a knowledge discovery component to a database engine or by integrating KDD with spreadsheets and visualizations.
3. Define Data warehouse. Explain Multidimensional model?

Data warehouse:

Data warehouse is a Subject oriented, Integrated, Time variant, and non volatile collection of data for managers decision making process.

Multi-dimensional model is an integral aspect of the On-line Analytical Processing which also known as OLAP.
Due to the fact that OLAP is online it provides information quickly, iterative queries are often posed during interactive sessions.

Due to the analytical nature of OLAP the queries are often complex. The multi-dimensional model is used to solve this kind of complex queries. The model is important because it applies simplicity.

This helps users understand the databases and enables software to plot a course through the databases effectively.

Multi-dimensional data models are made up of logical cubes, measures, and dimensions. Within the models you can also find hierarchies, levels, and attributes.

The straightforwardness of the model is essential due to the fact that is identifies objects that represent real world entities.

The analysts know what measures they want to see, what dimensions and attributes make the data important, and in what ways the dimensions of their work is organized into levels as well as hierarchies.

4. Define data mining. What are the major issues of Data Mining?

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations

Major Issues
One of the key issues raised by data mining technology is not a business or technological one, but a social one. It is the issue of individual privacy. Data mining makes it possible to analyze routine business transactions and glean a significant amount of information about individuals buying habits and preferences.
Another issue is that of data integrity. Clearly, data analysis can only be as good as the data that is being analyzed. A key implementation challenge is integrating conflicting or redundant data from different sources. For example, a bank may maintain credit cards accounts on several different databases. The addresses (or even the names) of a single cardholder may be different in each. Software must translate data from one system to another and select the address most recently entered.
A hotly debated technical issue is whether it is better to set up a relational database structure or a multidimensional one. In a relational structure, data is stored in tables, permitting ad hoc queries. In a multidimensional structure, on the other hand, sets of cubes are arranged in arrays, with subsets created according to category. While multidimensional structures facilitate multidimensional data mining, relational structures thus far have performed better in client/server environments. And, with the explosion of the Internet, the world is becoming one big client/server environment.
Finally, there is the issue of cost. While system hardware costs have dropped dramatically within the past five years, data mining and data warehousing tend to be self-reinforcing. The more powerful the data mining queries, the greater the utility of the information being gleaned from the data, and the greater the pressure to increase the amount of data being collected and maintained, which increases the pressure for faster, more powerful data mining queries. This increases pressure for larger, faster systems, which are more expensive.
5. Explain the construction and implementation of Multi Dimensional Model.

Multi-dimensional model is an integral aspect of the On-line Analytical Processing which also known as OLAP.
Due to the fact that OLAP is online it provides information quickly, iterative queries are often posed during interactive sessions.

Due to the analytical nature of OLAP the queries are often complex. The multi-dimensional model is used to solve this kind of complex queries. The model is important because it applies simplicity.

This helps users understand the databases and enables software to plot a course through the databases effectively.

Multi-dimensional data models are made up of logical cubes, measures, and dimensions. Within the models you can also find hierarchies, levels, and attributes.

The straightforwardness of the model is essential due to the fact that is identifies objects that represent real world entities.

The analysts know what measures they want to see, what dimensions and attributes make the data important, and in what ways the dimensions of their work is organized into levels as well as hierarchies.

6. Explain the Architecture of Data mining with a neat diagram?
Architecture for Data Mining
To best apply these advanced techniques, they must be fully integrated with a data warehouse as well as flexible interactive business analysis tools. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud detection, new product rollout, and so on. Figure 1 illustrates an architecture for advanced analysis in a large data warehouse.
Figure 1 - Integrated Data Mining Architecture

The ideal starting point is a data warehouse containing a combination of internal data tracking all customer contact coupled with external market data about competitor activity. Background information on potential customers also provides an excellent basis for prospecting. This warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.
An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data as they want to view their business – summarizing by product line, region, and other key perspectives of their business. The Data Mining Server must be integrated with the data warehouse and the OLAP server to embed ROI-focused business analysis directly into this infrastructure. An advanced, process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions.
This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users’ business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing a dynamic metadata layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.

7. Explain the architecture of Data warehouse with a neat diagram.
Data Warehouse Architecture (with a Staging Area)
We need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.
Figure 1-3 Architecture of a Data Warehouse with a Staging Area
Although the architecture in the above figure is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. The above figure illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
8. Write a short note on Data mart, Metadata with an example?
A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs. Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.
Metadata (meta data, or sometimes metainformation) is "data about other data", of any sort in any media. An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema. In data processing, metadata is definitional data that provides information about or documentation of other data managed within an application or environment. The term should be used with caution as all data is about something, and is therefore "metadata" in a sense, and vice versa.
For example, metadata would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Metadata may include descriptive information about the context, quality and condition, or characteristics of the data. It may be recorded with high or low granularity.
9. Differentiate OLAP and OLTP?
OLTP OLAP
Current data Current and historical data Short database transactions Long database transactions Online update/insert/delete Batch update/insert/delete Normalization is promoted De normalization is promoted High volume transactions Low volume transactions Transaction recovery is necessary Transaction recovery is not necessary
10. Explain the operations of OLAP?
Think of OLAP operations (ROLLUP, CUBE, and GROUPING SETS) as follows:
· A list of prefixes is constructed for the query. A prefix is a subset of the items in the group-by-list.
A prefix is constructed by excluding one or more of the rightmost items from those in the query's group-by-list. The remaining columns are called the prefix columns.
Take the following ROLLUP query as an example: SELECT year (order_date) Year, quarter (order_date) Quarter, count (*) OrdersFROM sales_orderGROUP BY ROLLUP (Year, Quarter)ORDER BY Year, Quarter
In the example above the group-by-list includes two variables (Year and Quarter).GROUP BY ROLLUP (Year, Quarter)

11. Explain Fact table and Dimensional table?
In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the centre of a star schema, surrounded by dimension tables.
Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).
Fact Table The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables store different types of measures like additive, non additive and semi additive measures
Dimension Table
a dimension table is one of the set of companion tables to a fact table.
The fact table contains business facts or measures and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.
The dimension tables contain attributes (or fields) used to constrain and group data when performing datawarehousing queries.
Over time, the attributes of a given row in a dimension table may change. For example, the shipping address for a company may change. Kimball refers to this phenomenon as Slowly Changing dimensions. Strategies for dealing with this kind of change are divided into three categories:
Type One - Simply overwrite the old value(s).
Type Two - Add a new row containing the new value(s), and distinguish between the rows using Tuple- versioning techniques.
Type Three - Add a new attribute to the existing row.
12. Define Database server?
A computer in a LAN dedicated to database storage and retrieval. The database server is a key component in a client/server environment. It holds the database management system (DBMS) and the databases. Upon requests from the client machines, it searches the database for selected records and passes them back over the network.A database server and file server may be one and the same, because a file server often provides database services. However, the term implies that the system is dedicated for database use only and not a central storage facility for applications and files. See client/server.
13. What is Graphical user Interface. Explain the types of Graphical user interfaces.
A graphical user interface (GUI, pronounced) is a type of user interface which allows people to interact with electronic devices such as computers; hand-held devices such as MP3 Players, Portable Media Players or Gaming devices; household appliances and office equipment with images rather than text commands. A GUI offers graphical icons, and visual indicators, as opposed to text-based interfaces, typed command labels or text navigation to fully represent the information and actions available to a user. The actions are usually performed through direct manipulation of the graphical elements
14. Explain different types of databases.
Types of databases:
i) Relational Databases
A relational database is a database that groups data using common attributes found in the data set. The resulting "clumps" of organized data are much easier for people to understand. For example, a data set containing all the real estate transactions in a town can be grouped by the year the transaction occurred; or it can be grouped by the sale price of the transaction; or it can be grouped by the buyer's last name; and so on.
ii) Data warehouse
Data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis.
This definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.
iii) Operational databases
Operational Databases are very important to a business. These databases allow a business to enter, gather, and retrieve specific company information. Operational databases can be known by another name – production database. Known by different names, users can misunderstand what the database is supposed to be used for within a business. For instance, transaction database could mean that the information stored focuses on financial information even though it may not. Operational databases can store different types of information such as training status, personal employee information, and previous proposal information. Storing information in a centralized area can increase retrieval time for users. Operational databases are important when information is needed quickly. An important feature of storing information in an operational database is the ability to share information across the company. Another feature of an operational database is how much information can be stored that pertains to a business. Depending on the type of operational database being used will determine how much information it can hold.
iv) Object – Relational databases
An object-relational database (ORD) or object-relational database management system (ORDBMS) is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, it supports extension of the data model with custom data-types and methods.
v) Spatial databases
A spatial database is a database that is optimized to store and query data related to objects in space, including points, lines and polygons. While typical databases can understand various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types. These are typically called geometry or feature. The Open Geospatial Consortium created the Simple Features specification and sets standards for adding spatial functionality to database systems.

15. Differentiate Heterogenous and legacy data bases.
A Heterogeneous Database System is an automated (or semi-automated) system for the integration of heterogeneous, disparate database management systems to present a user with a single, unified query interface.
Heterogeneous database systems (HDBS) are computational models and software implementations that provide heterogeneous database integration [1,2]
Data across constituent databases may be related but different. Perhaps a database system must be able to integrate genomic and proteomic data. They are related - a gene may have several protein products - but the data is different (nucleotide sequences and amino acid sequences, or hydrophilic/phobic amino acid sequence and positive/negatively charge amino acids). There may be many ways of looking at semantically similar, but distinct datasets.
The system may also be required to present 'new' knowledge to the user. Relationships may be inferred between data according to rules specified in domain ontologies.
Legacy data is data that you already have and use. Most often, this takes the form of records in an existing database on a system in current use. At mindwrap., the questions we are most frequently asked concern how to use these legacy records, how to "image enable" an existing database, or how Optix can interact with an existing system. Often, the existing database forms an important part of daily work and will remain in place after Optix is installed.
Optix can help you leverage your legacy data in a variety of ways. Because Optix stores all of its management tables, workflow tables, and user-level attribute tables in an underlying database such as Oracle™, Informix™, or Sybase™, it is relatively easy to copy a legacy database into Optix by offloading records and onloading them into the Optix database. Using our index screen design tool, you can generate an indexing and query screen to match any existing record layout.
16. Explain classification of data mining systems.
Data mining includes database systems, machine learning, visualization and information science. Due to diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems to have clear classification of data mining system, will help distinguish between such systems and identify those that best match their needs. It may be classified as:
1. Classification according to the kind of databases mined.
2. Classification according to the kind of knowledge mined.
3. Classification according to the kind of techniques utilized.
4. Classification according to applications adapted.
17. Define Schema. Illustrate star schema with an example.
Schema may refer to:
Model or Diagram
Schematic, a diagram that represents the elements of a system using abstract, graphic symbols
Axiom schema, in formal logic
The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few fact tables (possibly only one, justifying the name) referencing any number of dimension tables. The star schema is considered an important special case of the snowflake schema.
Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the snowflake schema article.
Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and Dim_Product.
Each dimension table has a primary key on its Id column, relating to one of the columns of the Fact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary key Units_Sold column of the fact table in this example represents a measure or metric that can be used in calculations and analysis. The non-primary key columns of the dimension tables represent additional attributes of the dimensions (such as the Year of the Dim_Date dimension).
18.Explain fact table. Illustrate star schema with an exaple?
Fact Table The centralized table in a star schema is called as FACT table. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables store different types of measures like additive, non additive and semi additive measures
A star schema consists of fact tables and dimension tables. Fact tables contain the quantitative or factual data about a business--the information being queried. This information is often numerical, additive measurements and can consist of many columns and millions or billions of rows. Dimension tables are usually smaller and hold descriptive data that reflects the dimensions, or attributes, of a business. SQL queries then use joins between fact and dimension tables and constraints on the data to return selected information.
Simple star schemas
Any table that references or is referenced by another table must have a primary key, which is a column or group of columns whose contents uniquely identify each row. In a simple star schema, the primary key for the fact table consists of one or more foreign keys. A foreign key is a column or group of columns in one table whose values are defined by the primary key in another table. In IBM Red Brick Warehouse, you can use these foreign keys and the primary keys in the tables that they reference to build STAR indexes, which improve data retrieval performance.
When a database is created, the SQL statements used to create the tables must designate the columns that are to form the primary and foreign keys.
The following figure illustrates the relationship of the fact and dimension tables within a simple star schema with a single fact table and three dimension tables. The fact table has a primary key composed of three foreign keys, Key1, Key2, and Key3, each of which is the primary key in a dimension table. Nonkey columns in a fact table are referred to as data columns. In a dimension table, they are referred to as attributes.
Figure 11. Simple Star Schema
In the figures used to illustrate schemas:
· The items listed within the box under each table name indicate columns in the table.
· Primary key columns are labeled in bold type.
· Foreign key columns are labeled in italic type.
· Columns that are part of the primary key and are also foreign keys are labeled in bold italic type.
· Foreign key relationships are indicated by lines connecting tables.
Although the primary key value must be unique in each row of a dimension table, that value can occur multiple times in the foreign key in the fact table--a many-to-one relationship.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts.

Types of concept hierarchy:
1. Schema hierarchy: A schema hierarchy is a total or partial order among attributes in the database schema. Schema hierarchies may formally express existing semantic, relationships between attributes. Typically, a schema hierarchy specifies a data warehouse dimension.
2. Set-grouping hierarchies: A set-grouping hierarchy organizes values for a given attribute or dimension into groups of constants or range values. A total or partial order can be defined among groups. Set-grouping hierarchies can be used to refine or enrich schema-defined hierarchies, when the two types of hierarchies are combined. They are typically used for defining small sets of object relationships.

19. Define concept hierarchy. Explain the types of concept hierarchies.
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts.

Types of concept hierarchy:
1. Schema hierarchy: A schema hierarchy is a total or partial order among attributes in the database schema. Schema hierarchies may formally express existing semantic, relationships between attributes. Typically, a schema hierarchy specifies a data warehouse dimension.

2. Set-grouping hierarchies: A set-grouping hierarchy organizes values for a given attribute or dimension into groups of constants or range values. A total or partial order can be defined among groups. Set-grouping hierarchies can be used to refine or enrich schema-defined hierarchies, when the two types of hierarchies are combined. They are typically used for defining small sets of object relationships.

20.Differentiate ROLAP and MOLAP?
MOLAP (multidimensional OLAP) tools utilize a pre-calculated data set, commonly referred to as a data cube, that contains all the possible answers to a given range of questions. MOLAP tools feature very fast response, and the ability to quickly write back data into the data set (budgeting and forecasting are common applications). Primary downsides of MOLAP tools are limited scalability (the cubes get very big, very fast when you start to add dimensions and more detailed data), inability to contain detailed data (you are forced to use summary data unless your data set is very small), and load time of the cubes. The most common examples of MOLAP tools are Hyperion (Arbor) Essbase and Oracle (IRI) Express. MOLAP tools are best used for users who have "bounded" problem sets (they want to ask the same range questions every day/week/month on an updated cube, e.g. finance).
ROLAP (relational OLAP) tools do not use pre-calculated data cubes. Instead, they intercept the query and pose the question to the standard relational database and its tables in order to bring back the data required to answer the question. ROLAP tools feature the ability to ask any question (you are not limited to the contents of a cube) and the ability to drill down to the lowest level of detail in the database. Primary downsides of ROLAP tools are slow response and some limitations on scalability (depending on the technology architecture that is utilized). The most common examples of ROLAP tools are MicroStrategy and Sterling (Information Advantage). ROLAP tools are best used for users who have "unbounded" problem set (they don't have any idea what they want to ask from day to day; e.g., marketing). It is very important to pay close attention to the underlying architecture of ROLAP tools, as some tools are very "scalability challenged."

1 comment: