In SQL, aggregate functions will return the group by values or the aggregate function results, but it is difficult (or at least harder than it should be) to return the primary key or ROWID. In contrast, most programming languages will return the instance (or a pointer/reference to the instance) when searching for items.

Background

A project manager allocated me several weeks ago on emergency basis to help out with another project that was having difficulties with a SQL Historian system. I ended up developing a nice set of SQL tables, functions, and stored procedures to transfer data from a remote linked server into the Historian. It was actually rather fun to work on as it had several challenging aspects. Of course, after several late nights of development, I don't know how much fun like that I could take!

The key to the data transfer was obtaining the latest data from the remote database and inserting it into the Historian. The remote database was an Oracle database and the Historian was a SQL Server database. My first reaction was to do a simple query with a TOP specification but Oracle doesn't have a TOP function so I started to search for ways to achieve the same result. I came across the ROW_NUMBER() function and it did the trick although it required a sub-query such as:

SELECT  *
FROM    (
    SELECT
        SAMPLE_DAY
    ,   ASSET
    ,   TANK_LEVEL_MM
    ,   TANK_PRES_PSIG
    ,   TANK_VOL_M3
    ,   FEED_FLOW_M3HR
    ,   ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) RN
    FROM    REMOTE_DATA_VIEW_1
    ORDER BY ASSET, SAMPLE_DAY DESC
)
WHERE   RN = 1

It turns out that this query is more powerful than the TOP query as it essentially performs a GROUP BY aggregate (ASSET and order by SAMPLE_DAY descending) without some of the limitations of SQL aggregate functions.

Exploring Aggregate Queries

In a traditional SQL aggregate query, the columns returned by the query have to be in the GROUP BY clause or in an aggregate function. This means that it is difficult to return the primary key for a table because you usually are grouping by a name, location, or other non-unique field. Here is an example:

-- DDL: This table stores the tag names, source, and values. This
-- table is just for demonstrative purposes and does not represent
-- a normalized structure.
CREATE TABLE TAG_VALUE
(
    ValueID             int             IDENTITY    PRIMARY KEY
,   DateTime            datetime        NOT NULL
,   TagName             varchar(50)     NOT NULL
,   Value               numeric             NULL
,   Ignored             bit             NOT NULL    DEFAULT 0
,   Processed           bit             NOT NULL    DEFAULT 0
)
GO

-- Get the most recent entry grouped by tag name
SELECT  TagName
,       MAX(DateTime) AS DateTime
FROM    TAG_VALUE
GROUP BY TagName

If you try to put the primary key as a return column, you will get an error. Unfortunately this means you end up having an ugly subquery and potentially expensive (hopefully you have indices on the matching columns as well), such as:

SELECT  ValueID
FROM    TAG_VALUE   AS V1
    JOIN (

        SELECT  TagName
        ,       MAX(DateTime) AS DateTime
        FROM    TAG_VALUE
        GROUP BY TagName

    )               AS V2
    ON (    (V1.TagName  = V2.TagName)
        AND (V1.DateTime = V2.DateTime))

Using ROW_NUMBER

The ROW_NUMBER function has the powerful feature of specifying the PARTITION BY which provides some of the same functionality that GROUP BY would perform but with less hassle. For example, the following query analyzes the TAG_VALUE table and looks for duplicate TagName rows. It then marks the "old" values as "Ignored" and leaves the newest entry alone.

UPDATE  TAG_VALUE
SET     Ignored = 1
FROM    TAG_VALUE AS V1
    JOIN (
        SELECT  ValueID
        ,       ROW_NUMBER() OVER (PARTITION BY TagName ORDER BY DateTime DESC) AS RN
        FROM    TAG_VALUE
        WHERE   TagName IN (
            SELECT DISTINCT TagName
            FROM        TAG_VALUE
            GROUP BY    TagName
            HAVING      COUNT(TagName) > 1
            )
    ) AS V2
    ON (V1.ValueID = V2.ValueID)
WHERE   RN > 1

Conclusion

So, in conclusion, ROW_NUMBER provides an easier method of returning the actual row so you can perform updates, deletes, or return the primary key.


Sample SQL Code

/* ====
NOTE: The table structure resembles a database design that I have seen used.
I did not design the table structure and it is intentionally flat. Originally
this example used a remote query to allow SQL Server to query ORACLE, but for
the sample it is all in SQL Server. With very minor changes it works in Oracle
as well.
==== */

-- DDL: This table simulates a remote linked server
CREATE TABLE REMOTE_DATA_VIEW_1
(
    SAMPLE_DAY          datetime
,   ASSET               varchar(50)
,   TANK_LEVEL_MM       numeric
,   TANK_PRES_PSIG      numeric
,   TANK_VOL_M3         numeric
,   FEED_FLOW_M3HR      numeric
)
GO

-- Sample data
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-04', 'ASSET_01', 1200.0, 18.5, 100.0, 150.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-01', 'ASSET_01', 1100.0, 17.5,  90.0, 145.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-01-25', 'ASSET_01', 1000.0, 16.5,  80.0, 155.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-03', 'ASSET_02',  800.0, 19.5, 110.0, 160.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-02', 'ASSET_02',  750.0, 17.0,  95.0, 165.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-02-04', 'ASSET_03', 1500.0, 18.0, 100.0, 170.0)
INSERT REMOTE_DATA_VIEW_1 VALUES ('2008-01-31', 'ASSET_04', 1350.0, 19.0,  90.0, 135.0)
GO


-- Return the raw data
SELECT
    SAMPLE_DAY
,   ASSET
,   TANK_LEVEL_MM
,   TANK_PRES_PSIG
,   TANK_VOL_M3
,   FEED_FLOW_M3HR
,   ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) RN
FROM    REMOTE_DATA_VIEW_1
ORDER BY ASSET, SAMPLE_DAY DESC

/* ====
SAMPLE_DAY              ASSET    TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR RN
----------------------- -------- ------------- -------------- ----------- -------------- --
2008-02-04 00:00:00.000 ASSET_01 1200          19             100         150            1
2008-02-01 00:00:00.000 ASSET_01 1100          18             90          145            2
2008-01-25 00:00:00.000 ASSET_01 1000          17             80          155            3
2008-02-03 00:00:00.000 ASSET_02 800           20             110         160            1
2008-02-02 00:00:00.000 ASSET_02 750           17             95          165            2
2008-02-04 00:00:00.000 ASSET_03 1500          18             100         170            1
2008-01-31 00:00:00.000 ASSET_04 1350          19             90          135            1
==== */


-- Use a simple embedded query
SELECT  *
FROM    (
    SELECT
        SAMPLE_DAY
    ,   ASSET
    ,   TANK_LEVEL_MM
    ,   TANK_PRES_PSIG
    ,   TANK_VOL_M3
    ,   FEED_FLOW_M3HR
    ,   ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) AS RN
    FROM
        REMOTE_DATA_VIEW_1
    )   DATA
WHERE   RN = 1

/* ====
SAMPLE_DAY              ASSET    TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR RN
----------------------- -------- ------------- -------------- ----------- -------------- --
2008-02-04 00:00:00.000 ASSET_01 1200          19             100         150            1
2008-02-03 00:00:00.000 ASSET_02 800           20             110         160            1
2008-02-04 00:00:00.000 ASSET_03 1500          18             100         170            1
2008-01-31 00:00:00.000 ASSET_04 1350          19             90          135            1
==== */


-- This is an uglier version of the query above but provided for comparison.
-- Return only the most recent entry using SAMPLE_DAY and ASSET as keys. On a production
-- table, there should be a real primary key or ROWID that you would use instead.
SELECT
    DATA.SAMPLE_DAY
,   DATA.ASSET
,   DATA.TANK_LEVEL_MM
,   DATA.TANK_PRES_PSIG
,   DATA.TANK_VOL_M3
,   DATA.FEED_FLOW_M3HR
FROM    (

        SELECT
            SAMPLE_DAY
        ,   ASSET
        ,   ROW_NUMBER() OVER (PARTITION BY ASSET ORDER BY SAMPLE_DAY DESC) AS RN
        FROM
            REMOTE_DATA_VIEW_1

    )   AS  REMOTE_QUERY
    JOIN    REMOTE_DATA_VIEW_1  AS DATA
        ON (    (REMOTE_QUERY.SAMPLE_DAY  = DATA.SAMPLE_DAY)
            AND (REMOTE_QUERY.ASSET       = DATA.ASSET) )

WHERE   REMOTE_QUERY.RN = 1
ORDER BY DATA.ASSET, DATA.SAMPLE_DAY DESC

/* ====
SAMPLE_DAY              ASSET    TANK_LEVEL_MM TANK_PRES_PSIG TANK_VOL_M3 FEED_FLOW_M3HR
----------------------- -------- ------------- -------------- ----------- --------------
2008-02-04 00:00:00.000 ASSET_01 1200          19             100         150
2008-02-03 00:00:00.000 ASSET_02 800           20             110         160
2008-02-04 00:00:00.000 ASSET_03 1500          18             100         170
2008-01-31 00:00:00.000 ASSET_04 1350          19             90          135
==== */
GO

Technorati tags: , ,