Best Practices: Querying

The following suggestions would ensure you keep to minimum the time spent on data retrieval.

  • Query only for data you need - your SELECT clause should list only the required fields (do NOT 'select *') 
    • Note that SQream is a columnar database and by being so it will load into the GPU all the files holding columns in the statement. By selecting only relevant columns you save I/O on file retrieval from file system and upload/download to/from GPU
  • Prefer using 'select count(*)' over 'select count(column_name)'
    • 'select count(*)' reads information from the metadata hence avoiding unneeded calculation.  
    • In SQream the metadata is always up-to-date and there is no need for explicit statistics gathering
  • Fetch time can be expensive. When you predict to return many rows that you don't necessary need, add 'TOP N' to the query.
    • Keep in mind that the TOP N will always impact fetch time, after the full execution is over.
  • When processing JOIN, avoid joining large tables first and only then join the small tables, as it may cause  a lot of unnecessary processing. The first join should be the one that reduces the results set the most.
  • SQream optimizer can benefit for as many 'filters' as possible. Since the optimizer can inherit predicates when the logic of the query allows it, you can put the predicate on one side of the join
  • When group by is used combined with joins, consider reducing the tables result with the GROUP BY before performing the JOIN
  • On large queries (multiple joins etc), consider using saved queries to save compilation time. Note that saved queries can use variables.
    • See SQream SQL Reference Guide for saved queries examples, usage and limitations
  • Avoid DISTINCT on large data sets where possible
  • Cast large results (such as SUM) to BIGINT to avoid overflows
  • 135
  • 20-Aug-2017