Wednesday, May 11, 2011

Multidimensional data analysis with SQL

Using advanced SQL features, you can even perform multidimensional data analysis as well, enabling you to answer complex analytical questions.
This can be best understood by an example on group operations, aggregating data over multiple rows stored in several tables. Suppose you collect information about sales, storing the data in related tables within a single database schema. At the end of each quarter, you need to know the summarized sales figures for every region, for each month within that quarter.
To start with, let's create a new database schema, and the tables within it that are needed for this example. You can perform all these tasks from SQL*Plus.

CREATE USER usr IDENTIFIED BY usr;
GRANT connect, resource TO usr;
CONN usr/usr

CREATE TABLE salespersons (
emp_id VARCHAR2(10) PRIMARY KEY,
emp_name VARCHAR2(40)
)

CREATE TABLE regions (
reg_id VARCHAR2(2) PRIMARY KEY,
reg_name VARCHAR2(20)
)

CREATE TABLE orders(
ordno NUMBER PRIMARY KEY,
empid VARCHAR2(10) REFERENCES salespersons(emp_id),
regid VARCHAR2(2) REFERENCES regions(reg_id),
orddate DATE,
total NUMBER(10,2)
)

Next, you have to populate these tables with some data, so that you can perform queries against them. As the orders table refers to the salespersons and regions tables, you first have to populate these two tables with data. You might use the following statements for this:

INSERT INTO salespersons VALUES ('Abuzar', 'Abuzar Bukhari');
INSERT INTO salespersons VALUES ('Usman', 'Usman Bukhari');
INSERT INTO regions VALUES ('PK', 'Pakistan');
INSERT INTO regions VALUES ('SA', 'Saudi Arabia');

Now that you have filled up the salespersons and regions tables, you can move on and populate the orders table:

INSERT INTO orders VALUES
(1001, 'Abuzar', 'PK', '10-JAN-2010', 1450.00);
INSERT INTO orders VALUES
(1002, 'Abuzar', 'PK', '15-JAN-2010', 2310.00);
INSERT INTO orders VALUES
(1003, 'Usman', 'SA', '20-JAN-2010', 1480.00);
INSERT INTO orders VALUES
(1004, 'Abuzar', 'PK', '19-FEB-2010', 3700.00);
INSERT INTO orders VALUES
(1005, 'Usman', 'SA', '24-FEB-2010', 1850.00);
INSERT INTO orders VALUES
(1006, 'Usman', 'SA', '04-MAR-2010', 1770.00);
INSERT INTO orders VALUES
(1007, 'Usman', 'SA', '05-MAR-2010', 1210.00);
INSERT INTO orders VALUES
(1008, 'Abuzar', 'PK', '05-MAR-2010', 10420.00);
Commit;

You now have all the parts required to proceed with the example, and can issue a query that summarizes sales figures for every region for each month. Here is what this query might look like:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY ROLLUP(EXTRACT(MONTH FROM (o.orddate)), r.reg_name);

In the above query, you summarize the data representing the sales figures, generating subtotals for every region for each month, and then totals for each month with the help of the GROUP BY ... ROLLUP clause. The query results should look like this:

REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan             January         3760
Saudi Arabia         January         1480
                     January         5240
Pakistan             February        3700
Saudi Arabia         February        1850
                     February        5550
Pakistan             March          10420
Saudi Arabia         March           2980
                     March          13400
                                    24190
As you can see, the query output contains not only monthly sales per region and month totals for each month, but also the grand total, which is nothing but the sum of the month totals. You might exclude this grand total from the report if you are using a partial ROLLUP as shown next:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY EXTRACT(MONTH FROM (o.orddate)), ROLLUP(r.reg_name);

As the month field is excluded from the ROLLUP here, the month totals will not be summarized:

REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan             January         3760
Saudi Arabia         January         1480
                     January         5240
Pakistan             February        3700
Saudi Arabia         February        1850
                     February        5550
Pakistan             March          10420
Saudi Arabia         March           2980
                     March          13400
                    
Changing the order in which the parameters of ROLLUP appear will also affect the report. In our example, if you move the r. reg_name parameter to the first position, the report will generate totals for each region rather than for each month. Here is the query:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY ROLLUP(r.reg_name, EXTRACT(MONTH FROM (o.orddate)));   

Here is its output:
REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan             January         3760
Pakistan             February        3700
Pakistan             March          10420
Pakistan                            17880
Saudi Arabia         January         1480
Saudi Arabia         February        1850
Saudi Arabia         March           2980
Saudi Arabia                         6310
                                    24190
As you can see, in the above report there are no longer any month totals. Instead, it provides you with the totals for each region.

Cubing

So by now you are probably wondering if it's possible to have the totals for both each month and each region, thus having all possible combinations of the grouping columns within a single report. While there is more than one way to achieve this, using the CUBE extension of the GROUP BY clause is the most elegant and effective one.
The following figure gives a graphical depiction of how CUBE differs from ROLLUP when it comes to generating summary information:


Switching to SQL, you might write the following query using the CUBE function in the GROUP BY clause:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY CUBE(EXTRACT(MONTH FROM (o.orddate)), r.reg_name);                                                    

The previous query should generate the following report:
REGION               MONTH          SALES
-------------------- --------- ----------
                                    24190
Pakistan                            17880
Saudi Arabia                         6310
                     January         5240
Pakistan             January         3760
Saudi Arabia         January         1480
                     February        5550
Pakistan             February        3700
Saudi Arabia         February        1850
                     March          13400
Pakistan             March          10420
Saudi Arabia         March           2980


As you can see, you now have all possible subtotals and totals combinations presented in the report. This is why changing the order of columns in CUBE won't affect the report, unlike with ROLLUP. SO, the following query should actually give you the same results as above:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY CUBE(r.reg_name, EXTRACT(MONTH FROM (o.orddate)));

However, the order of rows in the report will be a little different:
REGION               MONTH          SALES
-------------------- --------- ----------
                                    24190
                     January         5240
                     February        5550
                     March          13400
Pakistan                            17880
Pakistan             January         3760
Pakistan             February        3700
Pakistan             March          10420
Saudi Arabia                         6310
Saudi Arabia         January         1480
Saudi Arabia         February        1850
Saudi Arabia         March           2980

Just as there is a partial ROLLUP, there is also a partial CUBE. For example, you might issue the following query:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY r.reg_name, CUBE(EXTRACT(MONTH FROM (o.orddate)));

This should generate the following output:
REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan                            17880
Pakistan             January         3760
Pakistan             February        3700
Pakistan             March          10420
Saudi Arabia                         6310
Saudi Arabia         January         1480
Saudi Arabia         February        1850
Saudi Arabia         March           2980

As you can see, it only includes the totals for each region, and not for each month.Interestingly, you would get the same results if you replaced CUBE with ROLLUP, issuing the following query:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY r.reg_name, ROLLUP(EXTRACT(MONTH FROM (o.orddate)));

Thus, a one column CUBE operation produces the same results as the identical ROLLUP operation.

Generating reports with only summary rows

In some cases, you may not need to include the rows that represent the subtotals generated by GROUP BY, but include only the total rows. This is where the GROUPING SETS extension of the GROUP BY clause may come in handy:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY GROUPING SETS(r.reg_name, EXTRACT(MONTH FROM (o.orddate)));

This query should produce the following report:
REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan                            17880
Saudi Arabia                         6310
                     January         5240
                     February        5550
                     March          13400
                    
You receive an interesting result when performing a partial GROUPING SET operation like the one shown next:

SELECT r.reg_name region, TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month') month, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY r.reg_name, GROUPING SETS(EXTRACT(MONTH FROM (o.orddate)));                    

This should give you the following results:

REGION               MONTH          SALES
-------------------- --------- ----------
Pakistan             March          10420
Pakistan             January         3760
Saudi Arabia         March           2980
Saudi Arabia         January         1480
Saudi Arabia         February        1850
Pakistan             February        3700

It's interesting to note that the above report doesn't include summary rows—neither for each month nor for each region. Instead, it shows you all those subtotal rows generated by GROUP BY and excluded from the full GROUPING SET operation discussed earlier in this section. In this sense, a partial GROUPING SET operation is a reverse operation to the corresponding full one.

Ranking
You might, for example, compose a query that will compute a rank for each region, based on the sales figures. To better understand how ranking works, though, let's first look at the query that simply summarizes sales figures per region:

SELECT r.reg_name region, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY GROUPING SETS(r.reg_name);

Here is the output you should see:
REGION                    SALES
-------------------- ----------
Pakistan                  17880
Saudi Arabia               6310

Now let's look at the query that will compute a sales rank for each region in ascending order:

SELECT r.reg_name region, SUM(o.total) sales, RANK() OVER (ORDER BY SUM(o.total) ASC) rank
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY GROUPING SETS(r.reg_name);

This time, the query results should look like this:
REGION                    SALES       RANK
-------------------- ---------- ----------
Saudi Arabia               6310          1
Pakistan                  17880          2


As you can see in the above rows, not only are region sales arranged in ascending order, but the rank column has also appeared. If you don't want to see the rank column, you might remove it from the select list and add the ORDER BY clause with the rank function, rewriting the query as follows:

SELECT r.reg_name region, SUM(o.total) sales
FROM regions r, orders o
WHERE r.reg_id = o.regid
GROUP BY GROUPING SETS(r.reg_name)
ORDER BY RANK() OVER (ORDER BY SUM(o.total) ASC);

This should produce the following output:
REGION                    SALES
-------------------- ----------
Saudi Arabia               6310
Pakistan                  17880

As you can see, a little trick of moving the rank function from the select list to the ORDER BY clause lets you exclude the ranking column from the output, keeping sales figures in the sales column arranged in ascending order.

Windowing
Windowing functions make up another important group of analytic SQL functions. The idea behind windowing functions is that they enable aggregate calculations to be made within a "sliding window" that may float down as you proceed through the result set. For example, you might want to include a sliding average column in the report containing total monthly sales. This column would contain the average sales for each row, based on the sales figures in the preceding, current, and following rows. Thus, the average for the January row will be calculated based on the value of the sales field in the January row, as well as the value of the sales field in the February row. In turn, the average for the February row will be determined using the sales field's values throughout all three rows. Finally, the average for the March row will be computed based on the sales for February and March.

The following diagram gives a graphical illustration of this example: 



Let's first look at the query that simply calculates total monthly sales, which should look like this:

SELECT TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month')
month, SUM(o.total) sales
FROM orders o
GROUP BY TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month');

Here is the output:
MONTH          SALES
--------- ----------
January         5240
February        5550
March          13400

Now let's modify the query and define a "window" that will float down the above rows, calculating a value for the sliding_avg field. In the following query, note the use of the ROWS BETWEEN clause, and the keywords 1 PRECEDING and 1 FOLLOWING defining the parameters of the sliding window:

SELECT TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month')
month, SUM(o.total) sales,
AVG(SUM(o.total)) OVER (ORDER BY TO_CHAR(TO_DATE(EXTRACT(MONTH FROM
(o.orddate)),'MM'),'Month') ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
sliding_avg
FROM orders o
GROUP BY TO_CHAR(TO_DATE(EXTRACT(MONTH FROM (o.orddate)),'MM'),'Month' );

The output should look like the following:
MONTH          SALES SLIDING_AVG
--------- ---------- -----------
February        5550        5395
January         5240  8063.33333
March          13400        9320

Aside from the AVG function, you can apply the windowing technique using the other aggregate functions, such as SUM, MIN, and MAX.

No comments:

Post a Comment