Skip to main content

Parsing and Selecting Columns

Introduction

This post builds on top of the previous post and go over how to parse values in the dataset, select down to just the columns we are interested in.

This post starts from the workflow created in the last post, still using the Kaggle Superstore Dataset. If you have not read that post, I recommend you do so before continuing. That workflow can be downloaded from here.

Initial Workflow

The goal of this workflow is to get all the data into a form we can then start to analyze.

Parsing the Data

Table Info

Let's start by taking a look at the structure of the data. We can do this using the info function on the first node. To add this node, select the first node and press Return or drag out from the bottom of the first node. Then, in the new node, type info or choose it from the component browser. This function returns a new table containing the metadata about the table. In this case:

ColumnItems CountValue Type
Row ID9994Integer (64 bits)
Order ID9994Char (variable length, max_size=unlimited)
Order Date9994Char (variable length, max_size=unlimited)
Ship Date9994Char (variable length, max_size=unlimited)
Ship Mode9994Char (variable length, max_size=unlimited)
Customer ID9994Char (variable length, max_size=unlimited)
Customer Name9994Char (variable length, max_size=unlimited)
Segment9994Char (variable length, max_size=unlimited)
Country9994Char (variable length, max_size=unlimited)
City9994Char (variable length, max_size=unlimited)
State9994Char (variable length, max_size=unlimited)
Postal Code9994Char (variable length, max_size=unlimited)
Region9994Char (variable length, max_size=unlimited)
Product ID9994Char (variable length, max_size=unlimited)
Category9994Char (variable length, max_size=unlimited)
Sub-Category9994Char (variable length, max_size=unlimited)
Product Name9994Char (variable length, max_size=unlimited)
Sales9994Float (64 bits)
Quantity9994Integer (64 bits)
Discount9994Float (64 bits)
Profit9994Float (64 bits)

Enso will automatically parse columns of all numbers (such as Sales, Quantity, and Profit) and recognise standard format dates and times (such as 2023-11-09 or 2023-11-09 12:22:04).

At first glance, the Postal Code column appears to be a number (US 5-digit Zip codes), but as some have leading 0s, Enso will not parse this column by default. While we probably would want to keep these as text values, we can convert this to a number (specifically, an integer) using the table's parse function.

Parse Node

Once we have added the node, the next step is to choose the columns we are interested in and to specify the target type. Click on the right of the column placeholder to add a new name, then pick Postal Code from the dropdown. Finally, select Integer from the type dropdown.

Parse Postal Code

In this dataset, the date fields are stored in US format without leading zeros. Enso uses similar format strings to .Net to specify how to parse or format dates. The table below is not an exhaustive list but gives some key date specifiers.

SpecifierDescriptionExample
dDay of month (1 to 31)1
ddDay of month (2 digit, 01 to 31)01
dddDay of week (short name)Mon
ddddDay of week (full name)Monday
MMonth of year (1 to 12)1
MMMonth of year (2 digit, 01 to 12)01
MMMMonth of year (short name)Jan
MMMMMonth of year (full name)January
yyYear (2 digit)23
yyyyYear (4 digit)2023

The US format of M/d/yyyy is used to parse the Order Date and Ship Date columns. Add a new parse node and two entries to the columns placeholder. Then, choose Order Date and Ship Date from the dropdowns. Next, select Date from the type dropdown.

Parse Warning

At this point, the node will be surrounded by a yellow border, indicating that a problem has occurred when trying to parse these columns. By default, Enso attaches a warning to the resulting table and puts a Nothing value in for the cells it cannot parse. In this case, the format of the dates has not yet been given. There are two ways to set the format: add a new node (using the + button) and enter the value "M/d/yyyy" or edit the existing node and type the value at the end of the code. To edit the node, Ctrl-click on it, and it will switch to a text box showing the code. Type the format string at the end of the code preceded by a space ( "M/d/yyyy"). The yellow border will disappear, and the warning will be removed.

Parse Date

The new IDE (coming soon) will have significantly improved support for editing text placeholders.

The last two parameters in the parse function allow more control over the error handling. By default, if a column you have chosen does not exist, the function will error and stop. If you wish to allow this to continue, select False in the error_on_missing_columns dropdown, and a warning will be attached to the result instead of the process erroring.

The final parameter, on_problems, specifies what Enso should do with warnings. This argument is present in a lot of our functions. Generally, warnings are attached to the result and can be examined or handled within the workflow. You can choose one of the following from the dropdown to control this.

  • Ignore: warnings are ignored, and the result is returned.
  • Report Warning: warnings are attached to the result and then returned
  • Report Error: warnings are treated as a dataflow error, and the error is returned instead of the result.

In a later post, I will fully cover how to handle warnings and errors.

Selecting Columns

Order Date, Category, and Sales are the only columns needed for this task. To choose these columns, add a new node connected to the second parse node and choose select_columns from the component browser.

Select Columns

Add three new entries to the columns argument and choose the required columns from the dropdowns under each. The select_columns function will return a new table containing only the columns you have selected.

By default, the columns will be in the same order as the source table. Choose True from the reorder dropdown to return the columns in the order of the select list. The names are matched case-sensitively. To allow case differences, choose Insensitive from the case_sensitivity dropdown.