My technical project management work currently involves transforming clinical data into a normalized format and loading it into a database for machine learning researchers to use. The project has me wishing I knew more about SQL – I run a couple of reporting queries that developers wrote for me through mySQL workbench on a weekly basis, but wish I could formulate queries of my own.
Recently, a request came in to broaden a existing muti-database query to span more databases. Before I passed it on to our database developer, I decided to take the opportunity to study the current query, learn more about SQL, and see if I could then size the work involved in broadening the query. (or even do it myself).
I ultimately concluded that broadening the query would probably involve creating new tables in the databases in question – so it wouldn’t be quite the quick query rewrite that I hoped it would be. My suspicion was strengthened when I took the request to the dev manager and he assigned it lower priority on backlog.
I’m attaching here my notes on the SQL query (with dB names changed) so that I can remember my learning process starting from zero; all comments are mine:
#why is there no 'use db' statement? I guess it's because we are dealing w multiple dBs we mention below?
#so let's grab a bunch of columns from a bunch of different tables that we INNER JOIN on later
#we start by grabbing 'provider name' column, then provider name, etc
SELECT DISTINCT providerTable.provider_name AS 'Provider'
#and project names
,projectTable.project_name AS 'Project'
#ok some conditional logic to get the 'batch' column.
#from the 'batch report', there are a lot of 'provider_id's w value 1 (and some w/ 2, 3, 4, 5 etc
#oooook, contextual knowledge: this is just to deal w/ some batch naming historical idiosyncricities. for provider 1, we only use batch_id; for all others, we use batch id + sub batch id
,CASE WHEN reportTable.provider_id =1 THEN reportTable.batch_id
WHEN reportTable.provider_id<>1 THEN CONCAT(reportTable.batch_id, reportTable.sub_batch_id)
END AS 'Batch ID'
#batch_details is in: complete.batch, complete.missing_images, complete.missing_reports, complete.study_frequency
#so batch_delails must becoming from complete.batch b/c other tables aren't mentionend in this query
#but WHY isn't batch_details prefixed? is it because its UNIQUE in this query whereas other fields might be shared across other tables mentioned here?
#for example, 'group_flag" shows up in complete and controller, which are both mentioned.
#experiment, I did prefix w/ batchTable.batch_detail...and query still executes so...yeah, think I'm right.
#and I just saw something about 'ambiguous column name error' so yeah.
,batch_detail AS 'Batch Description'
#same comment -- guess it comes from batchTable.sub_condition
,sub_condition AS 'Sub Condition'
,splitTable.group_flag AS 'Group Flag'
,splitTable.split_id AS 'Split'
,splitTable.team_id AS'Team ID'
#looks like we are formatting date_delivered a little more nicely
,date_format( splitTable.date_delivered ,'%m/%d/%Y') AS 'Date Delivered'
,splitTable.number_of_patients AS 'Number Of Patients'
#OK, controller isn't some SQL notion (as I originally thought), it's just a DB name like complete or report, and split_details is a table in there.
#I guess we start w/ controller as our 'from' table b/c it's the main table we want, we just want to gussy it up w/ names columns instead of numeric ID columns
#and note that I turned some short var names (like st) into long ones (like splitTable) to make this easier to read for myself
#So we combine w/ the reportTable FIRST in order to take care of that weird batch # stuff? Don't really understand this bit, come back to it?
INNER JOIN status.batch_report reportTable
#batch id, subbatchid, provider id, and project id all have to match
splitTable.batch_id = reportTable.batch_id and splitTable.sub_batch_id = reportTable.sub_batch_id and splitTable.provider_id = reportTable.provider_id and splitTable.project_id = reportTable.project_id
#now we have another inner join.
#here's a nice illustration of how this works: https://academy.vertabelo.com/blog/illustrated-guide-multiple-join/
#here we pick up the provider name column
INNER JOIN controller.provider providerTable
ON reportTable.provider_id = providerTable.provider_id
#here we pick up the project name column
INNER JOIN controller.project projectTable
ON reportTable.project_id = projectTable.project_id
#here we pick up batch_detail and sub_condition
#not sure why we have to match on so many conditions vs providerTable and projectTable
#I guess provider and project tables are just super simple (id+name) so we don't need to match on much
INNER JOIN complete.batch batchTable
ON reportTable.batch_id = batchTable.batch_id and reportTable.sub_batch_id = batchTable.sub_batch_id and reportTable.provider_id = batchTable.provider_id and reportTable.project_id = batchTable.project_id;