Saturday, August 15, 2015

Recursive SQL

I have been inactive on this forum for a long time now, tackled quite a bit of challenging projects and did not get to shell out time for blog updates. It is time to make up for that now. 
As a thank you note to all my followers, I am planning to explain and explore "Recursive SQL" which most candidates I have interviewed so far don't have a clear understanding of. So I presume this topic is an ideal candidate to discuss here.

What is Recursive query and when do you typically need to use it?
Recursion is a process of querying iteratively, and to perform a function multiple times on the data of interest. It comes in handy when you need to access hierarchies of data. For example, when you need to access multi-categorical data (with multiple levels), such as Home Furnishings -> Dining -> Tables. OR looking to navigate organizational structure : CEO -> Senior VP -> VP -> Senior Director etc. You get the idea?

Knowing when to make use of Recursion is nearly as important as understanding the processes of recursion itself. So, think about the result you need and list down the options you have to achieve that result. There could be other straight forward approaches to solving a problem than recursion, so choose carefully. Note that for illustration purposes, I have used a simple dataset, whose result may be obtained in different ways. I have used recursion merely to show a working example.

Components of a recursive SQL:
1. Seed 
2. Recursion
3. Termination and final output

Seed:
This is the portion of the SQL that bring the subset of data, on which recursion is to be performed.

Recursion:
Operation to be performed multiple times on the seed

Termination:
Final output after the recursion completes

Keywords:
WITH RECURSIVE (parameters that need to be passed to recursion) - to invoke recursion
UNION ALL - to combine the intermediate results together

Syntax:


WITH RECURSIVE <REC_NAME>(<COLUMNS TO BE PASSED TO RECURSION SEPARATED BY COMMA>)
AS (
--SEED
(<SEED SQL>
UNION ALL
--RECURSION
<RECURSIVE SQL CREATING INTERMEDIATE RESULTS>
)
--FINAL OUTPUT
SEL <OUTPUT COLUMNNAMES>

FROM <REC_NAME>;

Example:
Illustrating an example of going through a hierarchy from top to bottom. Home & Garden has 2 subcategories: Dining and Outdoor, who in turn have their own subcategory: table, chair, hammock. We want to navigate through each of this category from top to the lowest available level.

   Input:


   Output:


Working SQL emulating above:
--SAMPLE DATA TO TEST RECURSION
CREATE VOLATILE TABLE CATEG
(ITEM_ID INTEGER,
CATEGORY VARCHAR(100),
SUBCATEGORY VARCHAR(100)
)PRIMARY INDEX(ITEM_ID)
ON COMMIT PRESERVE ROWS;

INSERT INTO CATEG
SEL 100,'HOME & GARDEN','DINING';

INSERT INTO CATEG
SEL 101,'DINING','TABLE';

INSERT INTO CATEG
SEL 100,'HOME & GARDEN','OUTDOOR';

INSERT INTO CATEG
SEL 102,'OUTDOOR','HAMMOCK';

INSERT INTO CATEG
SEL 103,'DINING','CHAIR';

--EXECUTE TO SEE THE RESULT OF RECURSIVE SQL
WITH RECURSIVE REC_NM(CATEGORY,SUBCATEGORY, LVL) AS
--SEED
(SELECT CATEGORY,SUBCATEGORY, 1
    FROM
    CATEG
UNION ALL
--RECURSION
SELECT A.CATEGORY,A.CATEGORY||':'||A.SUBCATEGORY||':'||B.SUBCATEGORY, LVL+1
FROM
REC_NM A, CATEG B
WHERE A.SUBCATEGORY = B.CATEGORY
)
--FINAL OUTPUT
SELECT CATEGORY,SUBCATEGORY
FROM REC_NM

QUALIFY RANK() OVER(ORDER BY LVL DESC) = 1;

Thursday, January 31, 2013

String manipulations in Teradata

Recently, I got this mildly challenging problem, (challenges are always welcome :))
I had to manipulate a varchar field to pick a part of the string, luckily, it follows a pattern

Sample data:

qstn: explain in detail - description
qstn2: what is the nature of the issue - type
qstn3: wildcard

I had to pick only a part of the string, but I had the following restrictions:

1. Cannot pick a fixed length or from a fixed position
2. Had to look for anything between : and -
3. - may or may not be present
4. If it is not present, use till last character

This is what I wrote:



SELECT
CASE WHEN INDEX(string_manip_field,'-') =
THEN (LENGTH(string_manip_field)-INDEX(string_manip_field,':')) 
ELSE (INDEX(string_manip_field,'-') - INDEX(string_manip_field,':') + 1) 
END AS NUM_CHAR
,SUBSTR(string_manip_field,INDEX(string_manip_field,':')+1,NUM_CHAR-2)
 FROM {Tablename}

Executable format:

SELECT
CASE WHEN INDEX('qstn: explain in detail - description','-') = 0 
THEN (LENGTH('qstn: explain in detail - description')-INDEX('qstn: explain in detail - description',':')) --number of characters will be from : till end
ELSE (INDEX('qstn: explain in detail - description','-') - INDEX('qstn: explain in detail - description',':') + 1) --number of characters will be from : till hyphen
END AS NUM_CHAR
,SUBSTR('qstn: explain in detail - description',INDEX('qstn: explain in detail - description',':')+1,NUM_CHAR-2)

Any other suggestions welcome. This blog is open to discussion.

Thursday, October 27, 2011

highest 10 ids

My colleague asked me today, how can I get the 10 highest ids in a table?
Sounds simple, right, but for many usage of ordered analytic function here is new... we always think about MAX, TOP etc... so here is a quick reference:

SELECT ID FROM TABLE

QUALIFY RANK() OVER (ORDER BY ID DESC) <=10;





Update:
My assumption here is that ID field I use here is unique. If it is not unique, rank() cannot be used, row_number() needs to be used instead.
(thanks for pointing this out, carlosal!!)

Also, TOP can be used along with ORDER BY clause and achieve the result, but in my opinion & experience, ordered analytic function is much faster for this task than TOP


SELECT TOP 10 ID FROM TABLE
ORDER BY ID DESC;

Thursday, February 17, 2011

Issue in INTERVAL function

I kept getting an annoying error, while I was trying to use interval function on timestamp field. This ended up to be an bug in teradata's feature. Totally surprised they have not handled this issue. Hopefully they will fix it after my DBA raises a ticket.
Here is the issue:
INTERVAL function does not adjust dates(for leap year or for february) as ADD_MONTHS does. For example, when you execute the following:
SELECT ADD_MONTHS('2011-01-31',1)
You will get an output as 2011-02-28
But, when you use interval function instead:
SELECT CAST('2011-01-31' AS DATE)+ INTERVAL '1' MONTH
You will get an error (2665: Invalid date).

Weird! isn't it? hopefully they fix it sooner and my post will be obsolete.

Sunday, October 31, 2010

Forcing Case Sensitivity in Teradata

Teradata is by default case insensitive. But there are certain scenarios that you might face, where you want the WHERE clause to check only for the pattern in the specified case. This is how you achieve it: by adding a predicate to the constraint (CASESPECIFIC)

Eg:

SELECT CURRENT_DATE
WHERE
UPPER('hello') = 'hello' (CASESPECIFIC); --will not return the current_date

SELECT CURRENT_DATE
WHERE
LOWER
('HELLO') = 'hello' (CASESPECIFIC); --will return the current_date

(Please leave a comment if you find my posts helpful)

Friday, October 8, 2010

Top vs Sample

There was a question raised in my team's discussion forum as to which one of these is better to use to get a sample of the data? There were different kinds of suggestions from various online forums, but none had the reason behind why one is better than the other.
I read through TD documentation and this is what I put forth to my group:

TOP:
1. The system evaluates the top n operator after all other clauses in the SELECT request have been evaluated. (So, I think it doesn’t help in data processing time. Must only affect the I/O time)
2. It gives the ranked result only when you specify order by clause. So, If you don’t specify order by, results can be any n rows.

SAMPLE:
1. SAMPLE clause comes with a keyword ‘RANDOMIZED ALLOCATION’ that specifies whether retrieved rows are sampled randomly across the amps or proportionate to the number of qualified rows per AMP. Proportional allocation is default.
2. SAMPLE also operates on the evaluated output. Hence, it does not improve query processing time.

My inference:
Comparing TOP and SAMPLE, as both execute the query completely, it is better to use TOP(without order by) because we will not provide the system an overhead of getting proportional output from each amp.

Just couple more interesting finds about TOP vs RANK:
1. The QUALIFY clause with the RANK or ROW_NUMBER ordered analytical functions returns the same results as the TOP n operator.
2. For best performance, use the TOP option instead of the QUALIFY clause with RANK or ROW_NUMBER. In best-case scenarios, the TOP option provides better performance; in worst-case scenarios, the TOP option provides equivalent performance

Hope this helps!

Wednesday, August 4, 2010

Timestamp to time

I recently encountered an issue, when I tried to typecast timestamp into time in SQL Assistant(it ran fine in bteq scripts),
Eg:
SELECT CAST((CURRENT_TIMESTAMP(0) + INTERVAL '7:00' HOUR TO MINUTE) AS TIME WITH TIME ZONE );
Is what I executed, and I got the following error:
SELECT Failed. 5402: WITH TIME ZONE is incompatible with specified data type.
 
My version details:
Teradata 12.00.00.0235
ODBC 13.00.00.04
 
Resolution:
Control Panel -> Administrative tools -> ODBC -> click configure on DSN -> Options -> Datetime format -> Set to AAA
(Mine was set to III)
 
I decided to write a blog entry, as I searched quite extensively on this issue, but was not able to find a resolution on internet. My DBA team helped me by filing TD service request. Hope this helps.
 
(P.S: Sorry for the discontinuity in performance enhancement suggestions, I will continue the series when I find some spare time)