Decoding Excel Scientific Notation in SQL and ADF Mapping Dataflow Expression Language

This is another one of those posts where I just want to write something down for my own benefit, but I’m putting it on the Internet in case someone goes a-Googling for the same thing and it can help them out.

One of the challenges I’ve had recently is importing data in an ETL process that’s been processed in Excel. While you can train the Excel users as much as you like, at some time or another, some numerical values are going to show up in Excel’s Scientific notation.

If the file is saved to CSV like this, that’s exactly what gets saved, as it’s displayed in the cells, not what Excel decodes it as.

Of course there are a few ways you can avoid this happening, which users should be trained on, but you can’t always guarantee they will.

If you’ve already got a lot of this in your database you can decode it in SQL like this

	  SELECT 
       scientificcolumn,
      Replace( Substring( scientificcolumn, 0, CHARINDEX('E+',scientificcolumn)  ),
       '.', '')
       + Replicate( '0', Cast( RIGHT( scientificcolumn, len(scientificcolumn) - CHARINDEX('E+',scientificcolumn) ) AS INT) - Len(
       Replace(
       Substring( scientificcolumn, 0, CHARINDEX('E+',scientificcolumn) - 1), '.',
       '') )
       ) AS Decoded
       
FROM   mytable
Where scientificcolumn like '[0-9]%.%[0-9]%E+%[1-9]'

Prevention is better than cure of course, so you should incorporate the conversion into your ETL process.

This is the Azure Data Factory method.

Create a Derived Column with the following formula:

iif(regexMatch(scientificcolumn, `\b\d\.\d+E\+\d{1,3}\b`),
rpad(
    replace(
        substring(
            scientificcolumn,
            0,
            minus(
                instr(
                    scientificcolumn,
                    'E+'),
            1)
        ),
    '.',
    ''),
add(
    toInteger(
        split(
            scientificcolumn,
            'E+')
        [2]),
    1),
'0'
),
scientificcolumn)

This will convert the column from scientific notation if it matches that regex expression and if not, then it’s passed through unchanged.

I’ve found this pretty reliable with no false positives so far, but I’ll be the first to admit my regex foo is pretty weak. If there’s a known and well established regex string for decoding Excel scientific notation I’d love to know.

Comments welcome.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s