Parsing and Selecting Columns
Introduction
This post builds on top of the previous post and go over how to parse values in the dataset, select down to just the columns we are interested in.
This post starts from the workflow created in the last post, still using the Kaggle Superstore Dataset. If you have not read that post, I recommend you do so before continuing. That workflow can be downloaded from here.
The goal of this workflow is to get all the data into a form we can then start to analyze.
Parsing the Data
Let's start by taking a look at the structure of the data. We can do this adding the column_info
component on the first component. To add this component, select the first component and click on the + at the bottom left of the component, or drag out from the bottom of the first component. Then, in the new component, type column_info
or choose it from the component browser. This function returns a new table containing the metadata about the table. In this case:
Column | Items Count | Value Type |
---|---|---|
Row ID | 9994 | Integer (64 bits) |
Order ID | 9994 | Char (variable length, max_size=unlimited) |
Order Date | 9994 | Char (variable length, max_size=unlimited) |
Ship Date | 9994 | Char (variable length, max_size=unlimited) |
Ship Mode | 9994 | Char (variable length, max_size=unlimited) |
Customer ID | 9994 | Char (variable length, max_size=unlimited) |
Customer Name | 9994 | Char (variable length, max_size=unlimited) |
Segment | 9994 | Char (variable length, max_size=unlimited) |
Country | 9994 | Char (variable length, max_size=unlimited) |
City | 9994 | Char (variable length, max_size=unlimited) |
State | 9994 | Char (variable length, max_size=unlimited) |
Postal Code | 9994 | Char (variable length, max_size=unlimited) |
Region | 9994 | Char (variable length, max_size=unlimited) |
Product ID | 9994 | Char (variable length, max_size=unlimited) |
Category | 9994 | Char (variable length, max_size=unlimited) |
Sub-Category | 9994 | Char (variable length, max_size=unlimited) |
Product Name | 9994 | Char (variable length, max_size=unlimited) |
Sales | 9994 | Float (64 bits) |
Quantity | 9994 | Integer (64 bits) |
Discount | 9994 | Float (64 bits) |
Profit | 9994 | Float (64 bits) |
Enso will automatically parse columns of all numbers (such as Sales
, Quantity
, and Profit
) and recognise standard format dates and times (such as 2023-11-09
or 2023-11-09 12:22:04
).
At first glance, the Postal Code
column appears to be a number (US 5-digit Zip codes), but as some have leading 0s, Enso will not parse this column by default. While we probably would want to keep these as text values, we can convert this to a number (specifically, an integer) using the table's parse
component.
Once we have added the component, the next step is to choose the columns we are interested in and to specify the target type. Click on the right of the column
placeholder to add a new name, then pick Postal Code
from the dropdown. Finally, select Integer
from the type
dropdown.
In this dataset, the date fields are stored in US format without leading zeros. Enso uses format strings to specify how to parse or format dates. The table below is not an exhaustive list but gives some key date specifiers.
Specifier | Description | Example |
---|---|---|
d | Day of month (1 to 31) | 1 |
dd | Day of month (2 digit, 01 to 31) | 01 |
ddd | Day of week (short name) | Mon |
dddd | Day of week (full name) | Monday |
M | Month of year (1 to 12) | 1 |
MM | Month of year (2 digit, 01 to 12) | 01 |
MMM | Month of year (short name) | Jan |
MMMM | Month of year (full name) | January |
yy | Year (2 digit) | 23 |
yyyy | Year (4 digit) | 2023 |
The US format of M/d/yyyy
is used to parse the Order Date and Ship Date columns. Add a new parse
component and choose Order Date
and Ship Date
from the columns dropdown list. Next, select Date
from the type dropdown.
At this point, the component will show a yellow triangle to its left, indicating that a problem has occurred when trying to parse these columns. By default, Enso attaches a warning to the resulting table and puts a Nothing
value in for the cells it cannot parse. In this case, the format of the dates has not yet been given. Click on the word "format" on the component, and a list of formats will appear. Choose M/d/yyyy from the list. The yellow triangle will disappear, and the warning will be removed.
The last two parameters in the parse
function allow more control over the error handling. By default, if a column you have chosen does not exist, the function will error and stop. If you wish to allow this to continue, select False
in the error_on_missing_columns
dropdown, and a warning will be attached to the result instead of the process erroring.
The final parameter, on_problems
, specifies what Enso should do with warnings. This argument is present in a lot of our functions. Generally, warnings are attached to the result and can be examined or handled within the workflow. You can choose one of the following from the dropdown to control this.
Ignore
: warnings are ignored, and the result is returned.Report Warning
: warnings are attached to the result and then returnedReport Error
: warnings are treated as a dataflow error, and the error is returned instead of the result.
In a later post, I will fully cover how to handle warnings and errors.
Selecting Columns
Order Date
, Category
, and Sales
are the only columns needed for this task. To choose these columns, add a new node connected to the second parse node and choose select_columns
from the component browser.
Choose the required columns from the column dropdown list. The select_columns
component will return a new table containing only the columns you have selected.
By default, the columns will be in the same order as the source table. Choose True
from the reorder
dropdown to return the columns in the order of the select list. The names are matched case-sensitively. To allow case differences, choose Insensitive
from the case_sensitivity
dropdown.