SAP Data Intelligence Archives - ERP Q&A

Using SAP Data Quality Management, microservices for location data in SAP Data Intelligence Pipeline

webadmin — Wed, 19 Oct 2022 11:21:49 +0000

I am happy to hear that many of you have signed up for SAP Data Quality Management, microservices for location data to uncover the address cleansing and geocoding capabilities on the SAP Business Technology Platform (BTP).

SAP Certification

Today I want to show how you can leverage these microservices for location data in the SAP Data Intelligence. The first thing, you may have noticed a little naming differences of the service. On the SAP BTP Cockpit, the microservices for location data is available as Data Quality Services. In SAP Data Intelligence, you can find the microservices for location data as DQMm operators in the Modeler application. We have DQMm Address Cleanse operator and DQMm Reverse Geo operator, and they are used in conjunction with DQMm Client operator which is a specialization of OpenAPI Client operator. These operators are available on SAP Data Intelligence Cloud as well as SAP Data Intelligence on premise.

DQMm Operators Overview

When you log in to the SAP Data Intelligence, go to the Modeler application. You can browse some graphs using the DQMm operators in the Graphs tab, and the DQMm operators in the Operators tab. They are currently available as Generation 1 operators.

Let’s open the configuration of the DQMm Address Cleanse operator. You can adjust the settings according to your preference.

Now open the DQMm Client configuration. You will need to set the connection properties with the OAuth authorization credentials.

Host: [url]:443
oauth2TokenUrl: [url]/oauth/token?grant_type=client_credentials
oauth2ClientId: [clientid]
oauth2ClientSecret: [clientsecret]

To find these values for you, you can go to the SAP BTP Cockpit where you subscribed to the Data Quality Services. Click the Service Key you created to see the credentials to find those property values.

Now you have configured the DQMm operators, let’s run the graph. You can also see the status if the graph has completed successfully.

At the time of writing this article, the graph template is still using the deprecated Write File operator. You might as well replace it with the latest Write File operator. Just delete the old File Writer operator (com.sap.storage.write) and add a new File Writer operator (com.sap.file.write). When you connect from the DQMm Client operator to the Write File operator, you will see the ToFile converter which converts the message to string. Select the first option.

Set the Path to the file path you want the output file to be, and change the Mode to Append to capture all the records.

Examine the DQMm output

Let’s go to the System Management to look at the generated output.

If you look at the data, you can see the attributes and body of the message are concatenated together, which is forming an invalid JSON structure.

Format the DQMm output

To convert the output of DQMm Client operator to the valid JSON format, let’s use the Format Converter operator. To do that, we need a few operators to prepare the data first. Let’s add a JavaScripts operator with two ports: input and output of the message type. Now you can add some JavaScripts code to create a message body with the data you want to output. You can use the code snippets below and adjust it as you needed.

$.setPortCallback("input", onInput);

// Convert a byte array into a string.
function bin2String(array) {
  var result = "";
  for (var i = 0; i < array.length; i++) {
    result += String.fromCharCode(array[i]);
  }
  return result;
}


// Input data handler.
function onInput(ctx, s) {
    
    // Retrieve the HTTP status code to see if the
    // request was successful.
    var status = s.Attributes["openapi.status_code"];
    var id = s.Attributes["message.request.id"];
    var body = bin2String(s.Body);

    // If the request was successful then convert
    // the JSON into a format that the Format Converter
    // operator will understand and output to the
    // output port.
    if (status === "200") {
        
        // Convert the body to a JSON object.
        var json = JSON.parse(body);
        
        // Add the id if one is present.
        if (id !== null) {
            json.id = s.Attributes["message.request.id"];
        }
        
        // Wrap the JSON in an array for the Format Converter
        // operator.
        s.Body = [json];
        
        // Output the new message.
        $.output(s);
    }
    else {
        $.log("Error processing record with id " + id, $.logSeverity.ERROR, status, body);
    }
}

After the JavaScripts operator, add the ToBlob operator because the Format Converter expects a blob type input. Now you can connect the DQMm Client operator to the Format Converter operator, and then the Write File operator.

Run this graph and check out the output file again in the System Management. You should see the data in the valid JSON format.

Use of Configuration

In the DQMm Address Cleanse operator, you have an option to specify the configuration source. If you choose service, you can specify a configuration name.

To view or create your own configuration, you can go to the Configuration UI.

In the Configuration UI, you can view some predefined configurations. You can see that there are simple address configuration samples and also additional configurations for the Business Suite applications. You can select one that suits for your application, make a copy of it to create a new configuration, and start customizing it for your application.

Example of reading from and writing to HANA Table

Using the Format Converter, you can also convert the data from the JSON format to the CSV format, or vise versa. Here is an example of reading from a HANA table and writing back to another HANA table with the DQMm operators.

1. Read HANA Table – Read address data from a HANA table
2. ToString Converter
3. JavaScripts Operator (input: string, output: string) – Convert the data to CSV format.

$.setPortCallback("input",onInput);

// Create CSV data
function onInput(ctx,s) {
    var first = true;
    var output = "";

    var obj = JSON.parse(s);
    for (var i in obj)
    {
        if (!first)
        {
            output += "\n";
            first = false;
        }
        var array = obj[i].toString();
        output += obj[i].toString();
    }
    $.output(output);
}

4. ToBlob Converter
5. Format Converter – Convert to JSON format
6. ToMessage Converter
7. DQMm Address Cleanse
8. DQMm Client
9. JavaScripts Operator (input: message, output: message) – Same as the previous example to format the DQMm output
10. ToBlob Converter
11. Format Converter – Convert back to CSV format
12. SAP HANA Client – Write validated address data to a HANA table

Example of consolidating output records

When you run a graph with the DQMm operators, you might have noticed the DQMm Client operator sends a HTTP request to the service record by record, and each response is passed down to the subsequent operator per record instead of a collection of records. If you want to consolidate all these records together before sending down to the subsequent operator, you can add some code to wait for all requests to be processed and then consolidate the output records. Here is an example.

1. Message Generator – Same as the one in the sample graph
2. 1:2 Multiplexer
3. ToString Converter
4. JavaScripts Operator 1 (input: string, output: int64)

$.setPortCallback("input",onInput)

function onInput(ctx,s) {
    var json = JSON.parse(s);
    $.output(json.length)
}

5. DQMm Address Cleanse
6. DQMm Client
7. JavaScripts Operator 2 (inputtarget: int64, inputcurrent: message, output: message)

$.setPortCallback("inputtarget",onInputTarget)
$.setPortCallback("inputcurrent",onInputCurrent)

var target = -1
var current = 0

var dqmmout = [];

// Convert a byte array into a string.
function bin2String(array) {
  var result = "";
  for (var i = 0; i < array.length; i++) {
    result += String.fromCharCode(array[i]);
  }
  return result;
}

function processResults(s) {
    // Retrieve the HTTP status code to see if the
    // request was successful.
    var status = s.Attributes["openapi.status_code"];
    var id = s.Attributes["message.request.id"];
    var body = bin2String(s.Body);
    var result = {};

    if (status === "200") {
        // Convert the body to a JSON object.
        result = JSON.parse(body);
    }
    else {
        result.error = $.log("Error processing record with id " + id, $.logSeverity.ERROR, status, body);
    }
    result.id = id;
    result.status = status;

    return result;
}

function onInputTarget(ctx,s) {
    target = s
    if (current == target && dqmmout.length == target) {
          $.output(dqmmout)
    }
}

function onInputCurrent(ctx,s) {
    current++
    
    // processResults
    var result = processResults(s);
    dqmmout.push(result)
    
    if (current == target) {
        $.output(dqmmout)
    }
}

8. ToFile Converter
9. Write File

Rating: 0 / 5 (0 votes)

The post Using SAP Data Quality Management, microservices for location data in SAP Data Intelligence Pipeline appeared first on ERP Q&A.

Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 3

webadmin — Tue, 13 Sep 2022 12:11:57 +0000

This is Part Three of a blog series on Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore. If you would like to start with Part One, please click here

Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 1

Recap

SAP Analytics Cloud makes it easy for businesses to understand their data through its stories, dashboards and analytical applications. Our worked example is using SAP Ariba Data to create an SAC Story that lets you know how much spend has been approved within Ariba Requisitions created in the last thirty days

A simple SAC Story tracking Approved Requisitions

In Part One of our blog series we discussed how we can retrieve data from SAP Ariba’s APIs using SAP Data Intelligence Cloud. We stored this data as JSON Documents in the SAP HANA Document Store

In Part Two of our blog series we built SQL and Calculation Views on top of our JSON Document Collection

In this blog post we’ll use the Calculation View in SAP Analytics Cloud as a Live Data Model, which will provide the data to our SAP Analytics Cloud Story

Viewing our HANA DocStore Collection data in an SAP Analytics Cloud Story

Accessing our HDI Container from SAP Analytics Cloud

Before we can consume our Calculation View in SAP Analytics Cloud, we’ll need to connect SAC to our HDI Container. We can do this from within SAC itself

Click on Connections

Click on Add Connection

Select SAP HANA under Connect to Live Data

Next, we’ll have to enter the host and credentials for our HDI Container. If you’re not sure where to find these, refer to Part One of this blog series where we retrieved these for Data Intelligence Cloud (under the heading Creating Our Connections in Data Intelligence)

Choose HANA Cloud as the Connection Type, enter HDI details then click OK

Our HANA Cloud Connection has been created and now we’re ready to create in SAC

Creating a Live Data Model

Within SAP Analytics Cloud we’re going to use a Live Data Model to access our Calculation View in real time. This means that the data in our Document Store Collection will be available immediately after our Data Intelligence Pipeline updates it

Another benefit of using this Live Data Model compared to creating a Model on Acquired Data is that data doesn’t need to be copied to SAP Analytics Cloud for use

Click on Modeler, then Live Data Model

Select SAP HANA as the System Type, our AribaHDI Connection then click on the Input Help for Data Source

Click on our Calculation View

Click OK

Now we’re looking at the Live Data Model in the SAP Analytics Cloud Modeler. We can see our Calculated Measure, ReportingAmount

Viewing the Measures for our Live Model

We can also check the Live Model’s Dimensions

Click on All Dimensions

Our Dimensions are all here

Click on Save

Enter a Name and Description then click Save

Now that we’ve got our Live Data Model, we’re ready to create our Story and visualize our Ariba Data

Creating an SAC Story

Stories within SAP Analytics Cloud let us use visualize data in a number of ways, including charts, visualizations and images

Click on Stories

Within a Story there are a number of different Page Types available. For our example we’re going to add a Responsive Page. A Responsive Page allows you to create layouts that resize and adapt when viewed on different screen sizes

Click on Responsive Page

Leave Optimized Design Experience selected and click Create

First we’re going to give our Page a title – for example: Approved Requisitions (Past 30 Days)

Double Click then give your Page a title

Next, we’re going to attach the Live Data Model we created to our Story

Click on Add New Data

Click on Data from an existing dataset or model

Click on Select other model

Click on our Live Data Model

Now we’re able to use the data from our Live Data Model in our Story

Click on Chart in the Insert Toolbar

Click on our Chart, then on the Chart Type dropdown and select the Numeric Point

Now it’s time to give our chart some data

Click on Add Measure under Primary Values

Click the checkbox next to ReportingAmount

Now we can see a sum of all of our Approved Requisitions. However, we may (or rather we probably will) have Requisitions in more than one currency. To separate these Requisitions we’ll need to use Chart Filters

Click on our Numeric Point again to exit the Measure Selection

Click on Add Filters

Select the Column ReportingCurrency

From here we’ll be able to select which values of ReportingCurrency we’d like to see reflected in our ReportingAmount total. Given that it doesn’t make sense to sum totals in different currencies without first converting them, we’re going to select only a single currency

The data in your system may have different currencies than mine, so feel free to adjust accordingly

Select your currency, unselect Allow viewer to modify selections and click OK

We now have our total ReportingAmount for our first currency

While it’s great to have our total, the average end user is not going to know what ReportingAmount means. It’s time to give our Numeric Point a meaningful label

Click on Primary Value Labels under Show/Hide to turn off the lower ReportingAmount label

Double click on the Chart Title to edit

Type your Chart Title and use the Styling panel as desired

We can adjust the size of our Chart using the arrow in the corner

At this point, we have our Numeric Point set up and ready to go

Our finished Numeric Point

If your system only has the one currency, you can leave it here. If your system has more than one currency, you can duplicate the Numeric Point, then change the Chart Filter and Chart Title using the same steps we just followed

Click on Copy, then click Duplicate

Once we’ve finished with our currencies, it’s time to save

Click on Save

Enter a Name and Description then click on OK

Our Story is now finished and ready to be viewed

Sharing our SAP Analytics Cloud Story

Now we’ve created our Story, but we don’t want it to just sit inside our Files on SAP Analytics Cloud – we want people to use it. Let’s share our story with our colleagues

Click on Files

Click on the checkbox next to our Story, then click on Share under the Share menu

Click on Add Users or Teams

Select users you’d like to share the Story with then click OK

Click the checkbox to notify the users by email, then click Share

If we’d like to change the Access Type, we can do that here

Now we’ve shared our Story with users, and decided what kind of access we’d like them to have. This isn’t the only type of sharing available in SAP Analytics Cloud – for example you can learn about publishing to the Analytics Catalog here, and read about your other options in the Help Documentation

The Analytics Catalog is a single access point for SAP Analytics Cloud content vetted by your own Content Creators

Scheduling the Data Intelligence Pipeline

Now that our setup work is done and the users can view our SAP Analytics Cloud Story, we want to make sure the underlying data in our SAP HANA Document Store Collection is kept up to date on a regular basis

Since our SAP Data Intelligence Pipeline is responsible for truncating the data and supplying a new batch of data, we want to schedule it to run automatically. We can do this from Data Intelligence Cloud itself

Click on Monitoring

Click on Create Schedule

Write a description for our Schedule then choose our Pipeline under Graph Name

Choose how often our Pipeline will be run

Our Schedule has been created

When we create our first Schedule, we’ll see My Scheduler Status: Inactive. We don’t need to worry – our Scheduler’s Status is actually Active. To see it, we can click on Refresh

Click Refresh

Our Scheduler Status is Active

An Important Note on Scheduling Ariba APIs

You may remember back in our first Blog Post that our Pipeline waits twenty seconds between each call of the Ariba APIs. This is because each of Ariba’s APIs has rate limiting. These rate limits are cumulative

What that means for us is that for each Realm and API (for example MyCorp-Test Realm and Operational Reporting for Procurement – Synchronous API) we have a shared rate limit, no matter how it’s called

The Pipeline we provided in Part One of this blog series is optimized for performance – i.e. it makes calls as fast as Ariba’s rate limiting will allow

If there’s more than one instance of this Pipeline running at once, both will receive a rate limit error and no data will be uploaded to our Document Store Collections

Please keep this in mind when you plan for the scheduling of these pipelines, and refer to the Ariba API Documentation for up-to-date information on Ariba Rate Limits as well as how many Records are returned each time the API is called

Wrap-Up

Throughout this blog series we’ve shown how we can set up a pipeline that will get data from Ariba and persist it in HANA Cloud Document Store, as well as how we can schedule it to run periodically

We’ve also shown how we can create a Calculation View on top of these JSON Documents, and finally how we can create a Story in SAP Analytics Cloud that will let us visualise our Ariba Data

The data and models in these blog posts have been rather simple by design, in a productive scenario we will likely want much more complex combinations and views.

Rating: 5 / 5 (1 votes)

The post Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 3 appeared first on ERP Q&A.

Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 2

webadmin — Mon, 12 Sep 2022 11:02:31 +0000

This is Part Two of a blog series on Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore. If you would like to start with Part One, please click here

Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 1

Recap

SAP Analytics Cloud makes it easy for businesses to understand their data through its stories, dashboards and analytical applications. Our worked example is using SAP Ariba Data to create an SAC Story that lets you know how much spend has been approved within Ariba Requisitions created in the last thirty days

A simple SAC Story tracking Approved Requisitions

In Part One of our blog series we discussed how we can retrieve data from SAP Ariba’s APIs using SAP Data Intelligence Cloud. We stored this data as JSON Documents in the SAP HANA Document Store

In this blog post, we’re going to build SQL and Calculation Views on top of our JSON Document Collection

Viewing our HANA DocStore Collection data in an SAP Analytics Cloud Story

Design-Time Artifacts in Business Application Studio

As we discussed in our last blog post, objects within HANA usually have both a design-time and runtime artifact. Design-time artifacts are useful because they fully describe the object and can be deployed consistently across multiple HDI Containers or even HANA instances

When we deploy our design-time artifacts, they will be created as runtime artifacts inside our HDI Container

Our JSON Document Collection has already been created, and is already storing our Ariba JSON Documents. From here, it’s time to model our other artifacts

Creating our SQL View

JSON Documents are useful in a variety of situations where you don’t have strict, predefined schemas. When we retrieve our data from the Ariba APIs, we may retrieve data that doesn’t map cleanly to a table schema (for example, data that is nested). Putting this data in the HANA DocStore Collection allows us to store the complete document, ensuring nothing is lost

In order for us to use this data for analytics, we’ll need to map it to some sort of schema. We can create a logical schema on top of our Collection using a SQL View. This allows us to access a predefined subset of our data for analytics while leaving the full data untouched in our Collection

We’ll create the SQL View in Business Application Studio

Click on View, then Find Command or press Ctrl+Shift+P

Use Find Command to find Create SAP HANA Database Artifact, then click on it

Select SQL View as the artifact type, and enter the artifact name then click on Create

SQL Views use the following format:

VIEW "aribaRequisitionSQLView"
AS SELECT "UniqueName", "Name", [...]
FROM "aribaRequisition"

If you’re familiar with SQL, you may recognise this as the same syntax that you would use to create a standard SQL View, just missing the word “CREATE”

The SQL View doesn’t duplicate any data, just provides a schema that we can use to access the underlying data

Our data in JSON Documents are Key-Value pairs

"Status":"Complete"

To retrieve the value “Complete”, we would SELECT “Status”

JSON Documents may also have nested data

"Content":{"ItemId":"3728754507"}

To retrieve the value “3728754507”, we would SELECT “Content”.”ItemId”, with the full stop marking nested keys

Our example will use the following SQL View:

VIEW "aribaRequisitionSQLView"
AS SELECT "UniqueName", "Name", 
"TotalCost"."AmountInReportingCurrency" AS "AmountInReportingCurrency", 
"ReportingCurrency"."UniqueName" AS "ReportingCurrency", 
"ApprovedState", "Preparer"."UniqueName" AS "Preparer",
"Requester"."UniqueName" AS "Requester", "StatusString", 
"CreateDate", "SubmitDate", "ApprovedDate", "LastModified", 
"ProcurementUnit"."UniqueName" AS "ProcurementUnit"
FROM "aribaRequisition"

The fields we’re using are only a fraction of the fields available in the Documents within our Collection – if we want to customize the scenario later, there are plenty more to choose from

We want to make sure this SQL View is deployed and ready for use, so click on the Deploy rocket

We can deploy our SQL View under SAP HANA Projects on the left

Creating our Calculation View

While we’re in Business Application Studio, we’re going to create our Calculation View. This Calculation View is what we’ll be consuming in SAP Analytics Cloud

As before, we’re using View->Find Command then Create SAP HANA Database Artifact

Choose Calculation View, enter a Name then click on Create

Business Application Studio has an inbuilt editor for Calculation Views, which we’ll use to create ours

Click on Aggregation, then click the Plus symbol

Search for our SQL View, select it, then click Finish

Now that our SQL View is available as a Data Source, we want to make sure its columns end up in our Calculation View

Click on Aggregation, then click on Expand Details

Click on our SQL View on the left then drag and drop to Output Columns on the right

Our SQL view columns will now be available in our Calculation View

Because this is a Calculation View of type Cube (rather than Dimension), we’ll need to make sure it includes at least one Measure

The columns in our SQL View all have the default data type NVARCHAR(5000). If we try to mark this column as a Measure directly, it will treat it as a string – giving us the Aggregation options COUNT, MIN and MAX

We want to treat this column as the number it is – as a workaround, we’ll need to create a Calculated Column

Creating our Calculated Column

A Calculated Column is an output column that we create within the Calculation View itself. Rather than being persisted, the values are calculated at runtime based on the result of an expression

For our example, we’re using a very simple expression. First, we have to make our way to the Expression Editor

Click on Calculated Columns

Create a Calculated Column using the Plus symbol, then Calculated Column

Click on the Arrow

Next we’re going to give our Calculated Column a name and data type. Because the granularity of our example is the Requisition-level and not the item-level, the decimal points won’t meaningfully change the results. Given that, we’re going to use the Integer data type

Give the Calculated Column a Name, and choose the Data Type Integer

Choose Measure as the Column Type

Click on Expression Editor

The Expression Editor is where we’ll define how the column is calculated. Select our AmountInReportingCurrency Column

Select our Column from the left

Our Column is in the Expression

Our Created Column will take the value of AmountInReportingCurrency and convert it to an Integer

Now we want to validate the syntax of our Expression

Click on Validate Syntax

Our Expression is valid

We have one last thing to do inside our Calculation View – we want to filter the data to only include Approved Requisitions. If we want to use the Value Help to set our Filter, we’ll need to Deploy the Calculation View

Deploy our Calculation View

Click on Filter Expression

Click on ApprovedState under Columns

Add an Equals Sign (=) then click on the Value Help

Select Approved then click OK

Now we can check the syntax of our Filter

Click on Validate Syntax

Our Filter is valid

Before we Deploy our Calculation View, we want to make sure that we’re only sending our integer Calculated Column and not the string version. To do this, we go the Semantics Node

Click on Semantics, then Columns

Check Hidden for our AmountInReportingCurrency Column to exclude it from our Calculation View

All of the Columns we need, including our new Calculated Column are available within the Calculation View. Now we’re ready to Deploy it one last time

Once again, click on the Deploy Rocket under SAP HANA Projects

Checking our Runtime Artifacts

Now that we’ve finished deploying our Design-time artifacts, we’ll have the corresponding Runtime artifacts inside of our HDI Container. We can check these by going to SAP HANA Database Explorer from within Business Application Studio

Click on Open HDI Container on the left under SAP HANA Projects

In the Database Explorer, we want to first check on our SQL View

Click Views on the left, then click on our SQL View

Our SQL View

We can see all of the Columns in our created SQL View. If we want to check out some of the data returned by our SQL View, we can click on Open Data

Click on Open Data

Data from our SQL View is displayed

Next it’s time to check on our Calculation View

Click Column Views on the left, then click on our Calculation View

Our Calculation View

Click on Open Data

Database Explorer will open our Calculation View for Analysis. We’re going to do our analysis in SAP Analytics Cloud, so for now we just want to verify the Raw Data

Click on Raw Data

Data from our Calculation View is displayed

Wrap-Up

During this blog post we’ve built a SQL View and Calculation View on top of our HANA DocStore Collection. We’ve also made sure that our Calculation View only contains Approved Requisitions

In the third and final blog post we’ll consume our Calculation View as a Live Data Model before visualizing it in an SAP Analytics Cloud Story. We’ll also schedule the Data Intelligence Pipeline we created in our first blog post so that the data in our HANA DocStore Collection is updated on a regular basis automatically

Rating: 5 / 5 (1 votes)

The post Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 2 appeared first on ERP Q&A.

Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 1

webadmin — Sat, 10 Sep 2022 11:22:24 +0000

Introduction

SAP Analytics Cloud makes it easy for businesses to understand their data through its stories, dashboards and analytical applications. However, sometimes we might not be sure how we can leverage SAC to create these based on data from other applications

For this worked example, we’re going to make use of SAP Data Intelligence Cloud to retrieve data from SAP Ariba through its APIs, before storing it in SAP HANA Cloud’s JSON Document Store

In further blog posts, we will build a model on top of this stored data and show how this can be consumed in an SAP Analytics Cloud Story. The focus of this series of blog posts is to show a technical approach to this need, not to provide a turnkey ready-to-run SAC story. After following this series you should have an understanding of how you can prepare your own stories using this approach

Ariba Analytics in SAP Analytics Cloud

For this example, we’re going to create a simple story that lets you know how much spend has been approved within Ariba Requisitions created in the last thirty days

A simple SAC Story tracking Approved Requisitions

A Requisition is the approvable document created when a request is made to purchase goods or services. Our approach will let us view only the Approved Requisitions, excluding those still awaiting approval

For those feeling more adventurous, this setup can be repeated with different document types, and those combined to create more in depth SAP Analytics Cloud Stories. This is outside of the scope of our blog series

Solution Overview

Our finished solution will need SAP HANA runtime artifacts such as Document Store Collections, SQL Views and Calculation Views. We will define these as design-time artifacts in Business Application Studio, then deploy them to an HDI Container within our SAP HANA Cloud instance

Deploying our Design-time artifacts into SAP HANA Cloud

Using a scheduled SAP Data Intelligence Cloud Pipeline, we’ll query SAP Ariba’s APIs and place the data within our HANA Cloud Document Store Collection

Scheduled replication of Ariba Data

Our SQL View lets us create a view on top of the data within our JSON Documents. Creating a Calculation View on top of one or many SQL views will let us expose the data to SAP Analytics Cloud

Viewing the data in SAP Analytics Cloud

SAP Analytics Cloud can use HANA Cloud Calculation Views as the source for Live Data Models. With Live Data Models, data is stored in HANA Cloud and isn’t copied to SAP Analytics Cloud

This gives us two main benefits: We avoid unnecessarily duplicating the data, and ensure changes in the source data are available immediately (provided no structural changes are made)

Finally, we use the Live Data Model to create a Story within SAP Analytics Cloud. Once we’ve got everything set up, we can use this story to check our data at any time, with the Data Intelligence Pipeline refreshing the data in the background on a predefined schedule

Creating an Ariba Application

In order to access the APIs provided by Ariba, we’ll need to have what’s known as an Ariba Application. We do this through the SAP Ariba Developer Portal

For our use case we will be requesting access to the Operational Reporting for Procurement API

From the Ariba Developer Portal, click on Create Application

Click on the Plus Symbol

Enter an Application Name and Description then click on Submit

Once the Application has been created, we’ll need to request API access for the Application

Click on Actions, then Request API Access

Select the Operational Reporting for Procurement API, then select your Realm and click on Submit

Once the API Access Request has been approved by Ariba, your admin will be able to generate the OAuth Secret for our application

Your Ariba admin can click on Actions, then Generate OAuth Secret

This will generate our OAuth Secret, which is required to use the API. The secret will only be displayed once, so the admin should (securely) store this and provide it to you for use in the application

If the OAuth Secret is lost, the admin can regenerate it, at which point the old secret will stop working and you will have to use the newly generated secret

Ariba API

When we call the Ariba API, we have a number of things to consider. For our example, we’re using the Synchronous API to retrieve data, but there’s also a set of Asynchronous APIs you should consider when retrieving bulk data

Documentation is available online

In addition, when retrieving data sets, you have to specify an Ariba View that you wish to retrieve. These are similar to reporting facts in the Ariba solution, such as Requisition or Invoice. Views will specify which fields are returned, and may also specify filters you should provide when calling them

To simplify our example we’re going to use a System View, which is predefined in Ariba. You are also able to work with Custom Views using the View Management API to better match your requirements but this falls outside the scope of this blog series

To explore these at your own pace, you can visit developer.ariba.com

Enabling Document Store in HANA Cloud

The Document Store is SAP HANA Cloud‘s solution for storing JSON Documents. While the Column and Row Stores use Tables to store their data, the Document Store stores data inside Collections

Before we activate the Document Store in HANA Cloud, just a word about resources. Like the Script Server, the Document Store is an additional feature that can be enabled, however we should consider HANA Cloud’s current resourcing before enabling it.

When we’re ready to enable, we’ll need to navigate to SAP HANA Cloud Central.

From the BTP Control centre, we select our Global Account and click Open in Cockpit

From here we see our Subaccounts – we choose the Subaccount where our HANA instance resides

From our Subaccount, we click on Spaces

From the Spaces page, we select the Space that contains our HANA instance

Click on SAP HANA Cloud

Click on Actions, then Open In SAP HANA Cloud Central

From HANA Cloud Central, we can then activate the Document Store

Click on the dots, then choose Manage Configurations

Click on Edit

Go to Advanced Settings, select Document Store then click on Save

Once our HANA Cloud instance has restarted, we’ll be able to use the Document Store

Creating a DocStore Collection in Business Application Studio

While we can create a Collection directly using SQL through Database Explorer, we want to make sure we also have a design-time artifact for our DocStore Collection

To do this, we’ll use the Business Application Studio. For those unfamiliar with Business Application Studio, you can follow this Learning Journey Lesson to set up a Workspace – we’ll assume this is already in place

It’s time to set up our SAP HANA Database Project, and create the HDI Container where our runtime objects will reside

Creating our Project

Select SAP HANA Database Project

Next we’ll need to provide some information for our project

Give our Project a name and click Next

Leave the Module name as is and click Next

Double check the Database Version and Binding settings then click Next

Setting our Database Information

Next we have to bind the project to a HANA Cloud instance within Cloud Foundry. The Endpoint should be automatically filled, but we have to provide our Email and Password before we can perform the binding

Binding our Cloud Foundry Account

For this example we’re going to create a new HDI Container

If our Cloud Foundry space has more than one HANA Cloud instance, we may want to disable the default selection and manually choose the HANA Cloud instance where our container will reside

Creating our HDI Container

Now that we have our HDI Container and SAP HANA Project set up, it’s time to create our design-time objects. First, we login to Cloud Foundry

Click on View, then Find Command or press Ctrl+Shift+P

Search and select CF: Login to Cloud Foundry, then follow the instructions before selecting the Space with our HANA Cloud instance

Next, we’ll create our DocStore Collection

Use Find Command again to find Create SAP HANA Database Artifact, then click on it

Ensure that the artifact type is Document Store Collection, name is aribaRequisition and that the artifact will be created within the src folder of a HANA Project, then click on Create

Finally, we want to find our SAP HANA Project on the Explorer on the left, and click on the rocket icon to Deploy

After the deployment is successful, we have both our design-time .hdbcollection artifact, as well as the runtime DocStore collection which has been created in our HDI Container

Creating our Connections in Data Intelligence

So far we’ve gained access to Ariba APIs and enabled the Document Store in our HANA Cloud Instance. Next, we’ll be setting up two Connections in Data Intelligence Cloud

The first Connection will allow our Data Intelligence Pipeline to query the Ariba APIs to retrieve our data, and the second will allow us to store this data in the Document Store within our HDI Container

First, we use the DI Connection Manager to create a new Connection, selecting OPENAPI as the Connection Type

Create a new Connection in the Connection Manager

Our OpenAPI Connection will be used to send the request to Ariba. We’re going to set the connection up as below, using the credentials we received when we created our Ariba Application

Using our Ariba Application OAuth Credentials to create the OpenAPI Connection

Next, we’re going to create a HANA Connection that will let us work with the HDI Container we created earlier. To get the credentials, we have to go to the BTP Cockpit

Select our HDI Container from the SAP BTP Cockpit

Click on View Credentials

Click on Form View

We’ll want to keep this window open as we create our HANA DB Connection, as it has the details we need. Within Data Intelligence Cloud, create a new connection of type HANA_DB and fill it out as below using the credentials

Enter the credentials to create our HDI Connection

While we have the credentials open, take note of the Schema name. We’ll need this to set up our pipeline

Pipeline Overview

The source code for our pipeline can be found. Copy the contents of this JSON to a new Graph within the Data Intelligence Modeler. If you’re not familiar with how to do this, you can refer to the README

When the pipeline starts, a GET request is made to the Ariba API. If there are more records to be fetched, the pipeline will make further requests until it has all available data. To avoid breaching Ariba’s rate limiting, there is a delay of 20 seconds between each call

Fetching data from Ariba

Once all of the records have been fetched, the Document Store Collection is truncated to remove outdated results, and the most up to date data is inserted into our collection

Updating records

A copy of the data is stored as a flat file in the DI Data Lake as reference
The HANA Document Store Collection is truncated, and Documents are added to the Collection one at a time
Once all records have been added to the Collection, the Graph will be terminated after a configurable buffer time (1 minute by default)

Configuring our Pipeline

In order to run this pipeline, you will have to make some changes to the pipeline:

In the Format API Request Javascript Operator, you should set your own values for openapi.header_params.apiKey and openapi.query_params.realm

You can edit this code from within the Script View of the Format API Request Operator

If your Connection names are different to ARIBA_PROCUREMENT and ARIBA_HDI, then you will want to select those under Connection for the OpenAPI Client and SAP HANA Client respectively

Changing the Connection for the OpenAPI Client

Changing the Connection for the HANA Client

Check the Path values for the operators “Write Payload Flat File” and “Write Error Log”. This will be where the pipeline will write the Flat File and API Error Logs respectively. If you’d like them to save elsewhere, edit that here

Setting the log paths

Finally, we’ll want to set the Document Collection Schema name in the DocStoreComposer Operator. This is the Schema we noted earlier while setting up the Connections

View the Script for our DocStoreComposer Operator

Add the Schema to the DocStoreComposer Operator

Testing our Pipeline

Now we’re ready to test our pipeline. Click on Save, then Run

Testing our Pipeline

Once our pipeline has completed successfully, we’ll be able to see that our JSON Documents are stored within our Collection by checking in the Database Explorer. We can access this easily through Business Application Studio by clicking the icon next to our SAP HANA Project

Getting to the Database Explorer from Business Application Studio

We can see that our pipeline has been successful, and that 206 JSON Documents have been stored in our Collection

Our Collection contains 206 Documents

Wrap-Up

In this blog post we’ve walked through how we can use SAP Data Intelligence Cloud to extract data from SAP Ariba, before storing it in a collection in SAP HANA Cloud’s Document Store

Rating: 5 / 5 (1 votes)

The post Ariba Analytics using SAP Analytics Cloud, Data Intelligence Cloud and HANA DocStore – Part 1 appeared first on ERP Q&A.

SAP Data Intelligence – What’s New in DI:2022/05

webadmin — Fri, 20 May 2022 13:08:33 +0000

SAP Data Intelligence, cloud edition DI:2022/05 will soon be available.

Within this blog post, you will find updates on the latest enhancements in DI:2022/05. We want to share and describe the new functions and features of SAP Data Intelligence for the Q2 2022 release.

SAP Data Integration with SAP Data Services Certification Preparation Guide

Overview

This section will give you a quick preview about the main developments in each topic area. All details will be described in the following sections for each individual topic area.

SAP Data Intelligence 2022/05

Metadata & Governance

In this topic area you will find all features dealing with discovering metadata, working with it and also data preparation functionalities. Sometimes you will find similar information about newly supported systems. The reason is that people only having a look into one area, do not miss information as well as there could also be some more information related to the topic area.

Validation Rules operator integrate with Metadata Explorer

USE CASE DESCRIPTION:

Ability for a Modeler to build a pipeline graph that reuses trusted Metadata Explorer’s validation and quality rules
Execution of rule validation from pipeline and reuse rules within rule operator

BUSINESS VALUE – BENEFITS:

Validation and quality rules created and defined by a subject matter expert in Metadata Explorer’s rulebooks can be reused by a Modeler in pipeline
Ability to run rulebooks in a pipeline and send pass and failed records to respective targets
- Allow subject matter expert to ‘fix’ failed records to improve quality of the data
Collaboration between data stewards / subject matter experts and modeler / developers
Quickly be able to use rules in a pipeline without having to create the rules from scratch

Public APIs for metadata exchange

USE CASE DESCRIPTION:

Ability to export out Metadata Explorer’s including:
- Lineage information of datasets, including relations with other datasets
- Used transformations and computations
- Schema information
- Profiling data
- User descriptions

BUSINESS VALUE – BENEFITS:

Ability to consume and use exported information in reporting tools for:
- Analysis
- Creating plot graphs to visualize lineage information based on organizational needs and requirements
- Reuse descriptions and annotations

Add Rules – Add Publishing – Add Connectivity within Metadata Explorer

BUSINESS VALUE – BENEFITS:

Expanded functionality support for sources * * New with DI:2022/05

Connectivity & Integration

This topic area focuses mainly on all kinds of connection and integration capabilities which are used across the product – for example: in the Metadata Explorer or on operator level in the Pipeline Modeler.

Connectivity to Teradata

Creating a new connection of type “TERADATA” in the connection management that can be used in Metadata Explorer as well as data source for extraction use cases in pipelines.

Supported version: 17.x
Support via SAP Cloud Connector

Supported qualities:

Metadata Explorer
- browsing
- show metadata
- data preview (tables)
Data Extraction via Generation 2 Pipelines
- Table Consumer
- SQL Consumer
- SQL Executor

Support of Google Cloud Storage (GCS) as target in Replication Flows

Creating a Replication Flow now allows to write data in form of files to GCS as a target using the following properties

Container (Target file root path)
Group Delta By (none, date, hour)
File Type (csv, parquet, json, json lines)
File compression (only for parquet)

For each replication flow, you can add one or several tasks to load the data in to GCS and:

Perform filtering (optional)
Change column mapping (optional)
Set or change target name
Select load type on data set level

Support of HANA Data Lake (HDL) Files as target in Replication Flows

Creating a Replication Flow now allows to write data in form of files to HDL-Files as a target using the following properties:

Container (Target file root path)
Group Delta By (none, date, hour)
File Type (csv, parquet, json, json lines)
File compression (only for parquet)

For each replication flow, you can add one or several tasks to load the data in to HDL-Files and:

Perform filtering (optional)
Change column mapping (optional)
Set or change target name
Select load type on data set level

Support of JSON & JSON Lines as target file type in Replication Flows

When creating a Replication Flow selecting a cloud object store as target (AWS S3, ADL V2, HDL Files or GCS), you can now also select:

JSON and
JSON Lines

as file formats in addition to previously available csv and parquet file formats.

When choosing JSON as file format, you can select between two different json formats:

Records
Values

Mass Data Replication via Replication Flows

Pipeline Modelling

This topic area covers new operators or enhancements of existing operators. Improvements or new functionalities of the Pipeline Modeler and the development of pipelines.

Migration graph for merging part files

USE CASE DESCRIPTION:

merge small part files generated by replication flows, including both initial and delta loads
Supported merge scenarios/file formats
- CSV to CSV
- Parquet to Parquet

BUSINESS VALUE – BENEFITS:

Achieve replication with configurable file size

Administration

This topic area includes all services that are provided by the system – like administration, user management or system management.

Encrypt data using Customer Managed Keys

USE CASE DESCRIPTION:

Integration of SAP Data Custodian Key Management Service and SAP Data Intelligence
- Supported for new DI Cloud instances created in AWS where a SAP Data Custodian Key Management service instance is available
- Feature can be enabled during the creation of a new DI instance
- Option to provide an existing Data Custodian Key reference to be used in the new DI instance

BUSINESS VALUE – BENEFITS:

Increased flexibility to use own encryption keys

Intelligent Processing

This topic area includes all improvements, updates and way forward for Machine Learning in SAP Data Intelligence.

Standalone Jupyter Lab Notebook

USE CASE DESCRIPTION:

Use Jupyter Lab for:
- EDA
- Data Preprocessing
- Data Manipulation

without a hard dependency on ML Scenario Manager.

BUSINESS VALUE – BENEFITS:

Jupyter Lab app has its own tile on the Launchpad
Enabling of its usage independently of ML Scenario Manager, without necessarily affecting any of the existing scenarios in MLSM
associate Jupyter Lab Notebooks to an existing ML Scenario

These are the new functions, features and enhancements in SAP Data Intelligence, cloud edition DI:2022/05 release.

Rating: 0 / 5 (0 votes)

The post SAP Data Intelligence – What’s New in DI:2022/05 appeared first on ERP Q&A.

SAP Data Services Code Review using SAP Data Intelligence

webadmin — Thu, 28 Apr 2022 09:37:48 +0000

This blog is demonstrating how we can leverage the power of SAP Data Intelligence around DEVOPS. In this blog we are giving glimpse how to check the ATL CODE of SAP Data Services developed by user following naming standard of an Organization or not

Using SAP Data Intelligence pipeline and metadata explorer we can proactively keep an eye on the development standards being followed or not in the SAP Data Services code development. Each company should have and follow development standards otherwise in the last it creates lot of confusion and the code would become fragile.

For this blog we have taken an example: According to the Naming Standards of an Organization, Datastore name should start from DST* i.e. DST_SAP, DST_LOOKUP. The same way can also be applied to tables, job names, project names, workflow names. For example staging tables should start from STG_EXT*. We can also check naming standards are being followed or not by directly accessing the metadata tables, but that way has not been considered in this blog because those tables are not accessible to everyone. We can also extract and deploy the SAP Data Services code to call AL_engine from DI pipeline but that is out of scope of this blog.

There are two prerequisites: –

ATL Code of SAP Data Services
Naming Standard Document Followed by an Organization.

Overview of SAP Data Intelligence pipeline processing of SAP Data Services ATL file to check the development standards being followed or not

Processing of SAP Data Services ATL file

Processing of the ATL file in the Data Intelligence pipeline to extract the Datastore information. In the below pipeline we are reading the input ATL file from the WASB location and processing using the python script and extracting datastore information. After extracting the datastore information then we are loading into a HANA database.

SAP Data Intelligence Pipeline processing SAP Data Services ATL file

The following configurations we have setup for different operators in the above pipeline:

Source Connection to WASB (Windows Azure Storage BLOB) for storing ATL files of SAP BODS

Read File Operator Configuration

After that we have included two convertors ToBlob and then ToString convertor.

Python Operator Configuration

Python Operator Script

We can enhance further as per our need like input dynamically the string to search.

After python operator we have included ToMessageConverter.

HANA Operator Configurations

HANA Operator Configurations continued

Output After Processing the ATL File, Fetched Datastore Names

In the same way we have loaded the standard of Data services Naming convention in a table

SAP Data Intelligence Pipeline Processing Naming Standard Document of an Organization

Read File Operator Configuration

HANA Operator Configurations

HANA Operator Configurations continued

Output of Data Services Naming Convention store in a table

To continuously check the standards being followed or not, we can develop rules and report the failed data to respective user via mail or dashboard analytics

Rules Dashboard

Rules Dashboard continued: Depicting the trend and keeping an eye on the development standard being followed or not

Definition of the Rule

Output of the Rule to showcase those Datastore names which are not following Naming standards

The following datastores are not following the naming standards. We have run multiple times and in second last run we directed the user to correct the name of the Datastore and the user corrected and in the last run the DS_STG_MGMT_LKP moved from failed rows to passed rows

Output showing failed rows of Initial run

Output of Last run after the correction of Datastore name by the Developer

So, 2 datastore out of 4 datastores are still not complying the SAP Data Services Naming standards of an Organization.

Check with the user and inform either through mail configured in SAP DI pipeline using Send Email operator or through dashboard and give access to the user

Rating: 0 / 5 (0 votes)

The post SAP Data Services Code Review using SAP Data Intelligence appeared first on ERP Q&A.

Introduction Data Intelligence Generation 2 Operators

webadmin — Thu, 03 Feb 2022 11:20:41 +0000

Introduction

With SAP Data Intelligence release 2110 a new set of operators have been introduced: The generation 2 operators. It is more a leap than a next step that includes not only new features but the fundamental design has also changed. This is a first release of these new type of operators therefore not all properties of the generation 1 operators have their counterpart in this next generation operators and not all operators are yet available. The next releases will close the gap.

The new features covered in this blog

Strict data type definitions -> increased code quality
Changes of creating new operators
Streaming support of operators
Pipeline resilience – Saving and retrieving operator states
Cross-engine data exchange -> far more flexibility, e.g. Custom Python operators connects with Structured Data Operators

SAP Certification

For a while the generation 1 and generation 2 operators will co-exists but you cannot mix them into one pipeline. When creating a pipeline you have to choose.

Strict Data Types

When making a custom operator you first start with defining the data types that the operator should receive and send. In addition you also define the header data type. A header type is not necessary but should be good practise. There is a way to define data types at runtime or infer data types, but again this should be rather the exception.

There are 3 types of data types:

Scalars – basic data types – e.g. integer, float, string,…
Structures – 1 dimensional list of Scalars – e.g. com.sap.error [code:int,operatorID:string, …] or embedded Tables (e.g. com.sap.error)
Tables – 2 dimensional list of Scalars

Hierarchical structures that you can build with dictionaries/json are not supported. We have to keep in mind that these data types should be used by all operators independent of the code language/service/engine. For the time being these kind of structures needs to be flattened.

Due to my project of testing the new Data Intelligence Monitoring PromQL API I am going to use a big data generation custom Python operator that sends periodically generated data with

id – com.sap.core.int64
timestamp – com.sap.core.timestamp
string_value – com.sap.core.string
float_value – com.sap.core.float64

Whereas the basic data types integer, float and string are common across most languages, timestamp for example has different implementations. This means you might have to cast specific types first before you use them in a vtype-data structures. We have yet voted against an automatic casting due to performance reasons.

The following table tells you how the vtype-scalars are implemented:

Type template	vflow base type	Value restrictions and format
bool	byte (P6) bool (P7)	enum [0, 1] (P6) enum [false, true] (P7)
uint8	byte (P6) uint64 (P7)	range [0; 2^8-1]
int8	int64	range [-2^7; 2^7-1]
int16	int64	range [-2^15; 2^15-1]
int32	int64	range [-2^31; 2^31-1]
int64	int64	range [-2^63; 2^63-1]
uint64	uint64	range [0; 2^64-1]
float32	float64	IEEE-754 32bit single-precision floating-point number. In order to not introduce an additional vflow base type, the value is stored in base type float64. approx. range [-3.4E38; 3.4E38]
float64	float64	IEEE-754 64bit double-precision floating-point number. approx. range [-1.7E308; 1.7E308]
decimal	string	Regex ^[-+]?[0-9]*\.?[0-9]+$ Derived type must specify precision and scale.
decfloat16	string	IEEE-754 64bit decimal floating-point number (decimal64) encoded as string. Regex ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ approx. range [-9.9E384; 9.9E384]
decfloat34	string	IEEE-754 128bit decimal floating-point number (decimal128) encoded as string. Regex ^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$ approx. range [-9.9E6144; 9.9E6144]
date	string	ISO format YYYY-MM-DD Range: 0001-01-01 to 9999-12-31.
time	string	ISO format hh:mm:ss.fffffffff Up to 9 fractional seconds (1ns precision). Values without fractional seconds or less than 9 fractional seconds are allowed, e.g. 23:59:59 or 23:59:59.12. Range [00:00:00.000 to 23:59:59.999999999]
timestamp	string	ISO format YYYY-MM-DDThh:mm:ss.fffffffff Up to 9 fractional seconds (1ns precision). Values without fractional seconds or less than 9 fractional seconds are allowed, e.g. 2019-01-01T23:59:59 or 2010-01-01T23:59:59.123456. range [0001-01-01T00:00:00.000000000 to 9999-12-31T23:59:59.999999999]
string	string	String encoding is UTF-8 Derived type can specify length property. The length property specifies the number of Unicode characters.
binary	blob	Derived type can specify length property. The length property specifies the number of bytes.
geometry	blob	geometry data in WKB (well-known binary) format
geometryewkb	blob	geometry data in EWKB (extended well-known binary) format

As a header data structure only Structures can be used. For my data generator operator I like to store the configuration parameter and the number of data that had already been generated:

Create Generation 2 Operator

The framework of generation 2 operators are quite similar to the generation 1 operators with a few exceptions.

Still the same:

Operator creation wizzard is still the same
The creation framework is still the same
1. ports
2. tags
3. configuration
4. script
5. documentation
Script framework
1. function calls hooked to port(s) with a callback function
2. api.config access
3. data sending functions

Differences in addition to function name change:

messages have ids
response can be send to previous operator (not covered in this blog)
with api.OperatorException hook into the standard operator exception flow
batch and stream data supporting functions (not covered in this blog)
additional functions to save and load operator state (resilience)
No generator function “api.add_generator”, yet only “api.add_timer” or “api.set_prestart”

Some additional information you find in the documentation of the generation 2 Python Operator.

First when you like you like to add a new Custom Python operator choose the “Python3 Operator (Generation 2)” option.

Sending Data Operator

To be explicit for my operator “Big Data Generator” I need no inport because I am going to use a timer that should regularly produce data tables and 3 outports:

output of data type table using the previously defined table data type diadmin.utils.performance_test
log of data type scalar com.sap.core.string for replacing the debug functionality
stop of data type scalar com.sap.core.bool for stopping the pipeline

For the configuration I add 3 parameters

num_rows (integer) for the number of records
periodicity (number) for the seconds to generate a new data table
max_index (integer) for the number of iterations before the pipeline should stop
snapshot_time (integer), the idle time before the forced exception is raised
crash_index (integer), the index at which the forced exception should be raised to demonstrate the resilience.

Finally we come to the scripting.

First I need to generate data consisting of an integer “id”, a Python datetime “timestamp”, random float and string column:

def create_test_df(num_rows,str_len=5):
    alphabeth = list(string.ascii_lowercase)
    df = pd.DataFrame(index=range(num_rows),columns=['id','timestamp','string_value','float_value'])
    df['id'] = df.index
    df['timestamp'] = datetime.now(timezone.utc)
    df['string_value'] = df['string_value'].apply(lambda x: ''.join(np.random.choice(alphabeth, size=str_len)))
    df['float_value'] =  np.random.uniform(low=0., high=1000, size=(num_rows,))
    return df

The callback function looks like:

from datetime import datetime, timezone
import string
import pickle
import time
import pandas as pd
import numpy as np

def gen():

    global index, data_df, crashed_already
    
    api.logger.info(f'Create new DataFrame: {index}')
    data_df = create_test_df(index,api.config.num_rows)
    
    # Create Header
    header_values = [index,api.config.max_index,api.config.num_rows,float(api.config.periodicity)]
    header = {"diadmin.headers.utils.performance_test":header_values}
    header_dict = dict(zip(['index','max_index','num_rows','periodicity'],header_values))    
    api.logger.info(f'Header: {header_dict}')
    
    # Check if graph should terminate
    if index >= api.config.max_index: # stops if it is one step beyond isLast
        api.logger.info(f'Send msg to port \'stop\': {index}/{api.config.max_index}')
        api.outputs.stop.publish(True,header=header)
        return 0
    
    # forced exception
    if index == api.config.crash_index and (not crashed_already) :
        api.logger.info(f"Forced Exception: {index} - Sleep before crash: {api.config.snapshot_time}  - Crashed already: {crashed_already}")
        crashed_already = True
        time.sleep(api.config.snapshot_time)
        raise ValueError(f"Forced Crash: {crashed_already}")

    
    # convert df to table including data type cast
    data_df['timestamp'] = data_df['timestamp'].apply(pd.Timestamp.isoformat)
    data_df = data_df[['batch','id','timestamp','string_value','float_value']]
    tbl = api.Table(data_df.values.tolist(),"diadmin.utils.performance_test")
    
    # output port
    api.outputs.output.publish(tbl,header=header)  
    
    # log port
    api.outputs.log.publish(f"{index} - {data_df.head(3)}")   

    index += 1

    return api.config.periodicity

api.add_timer(gen)

There are two comments regarding to the changes from Generation 1 scripts:

Outports

With Gen2 the outport name is now an attribute of the class “outputs” (api.output.output, api.output.log and api.output.stop) and the outport has the method publish that takes 3 arguments: api.outputs..publish(data, header=None, response_callback=None) where only the data is necessary.

Outport Data Format

Data send to outports must comply to the data types of the outport. That means

scalars: data value according to the vtype representation (see table)
structure: 1-dim list of values. For headers using “structure”-data type you have a dictionary with the key = vtype-id and a list of values with the given sequence
table: api.Table constructor with the vtype-id and a 2-dimensional list of values

In this case we use a structure-vtype for the header (metadata) and a table-type for the actual data.

Structures consist of dictionary with one key, that is the vtype id (e.g. “diadmin.headers.utils.performance_test”) and an array of values with the same order as defined in the vtype definition.

header = {"diadmin.headers.utils.performance_test":[index,api.config.max_index,api.config.num_rows,float(api.config.periodicity)]}

A genuine dictionary might be helpful but that is the current approach. In addition there is no way to retrieve the vtype definition (names and data types) from the registry. This might be added in the next release.

For the table-vtypes we have some more support. To construct a table you need to pass the vtype-id and a 2-dimensional array for the data, e.g.

data_df['timestamp'] = data_df['timestamp'].apply(pd.Timestamp.isoformat)
data_df = data_df[['batch','id','timestamp','string_value','float_value']]
tbl = api.Table(data_df.values.tolist(),"diadmin.utils.performance_test")

Again you have to check if the order of the columns corresponds to the vtype definition and if the data types are among the supported. In the case of the datetime I convert it to a string in isoformat.

With this we have all the pieces for the first generation 2 operator. But be aware that for testing this operator you have to connect operators to all ports to which you send data or comment out the “publish”-call.

Receiving Data Operator

As a counterpart for the previously described “sending” operator we create a simple “receiving” operator with the same inport data type. Because the operator is doing nothing but providing food for the garbage collector of Python we call it “Device Null”.

The script does nothing more than to unpack the header information and the data and construct a pandas DataFrame.

import pandas as pd

def on_input(msg_id, header, data):
    header_dict = dict(zip(['index','max_index','num_rows','periodicity'],list(header.values())[0]))
    tbl = data.get()
    api.outputs.log.publish(f"Batch: {header_dict['index']} Num Records: {len(tbl)}",header=header) 
    tbl_info = api.type_context.get_vtype(tbl.type_ref)
    col_names = list(tbl_info.columns.keys())
    df = pd.DataFrame(tbl.body,columns = col_names)
        
api.set_port_callback("input", on_input)

callback function “on_input”

The callback function of an inport has three arguments

message_id: for identifying the message
header: the metadata information always a list of vtype “structure”
data: the data with one of the types scalar, structure or table

Access to the Header

Unsurprisingly the header data can be access reciprocal to the creation. The header is a dictionary with an array of the values, e.g the first value “index” can be retrieved by

header['diadmin.headers.utils.performance_test'][0]

or again you map the data to dictionary that reflects the vtpye-definition of the header

header_dict = dict(zip(['index','max_index','num_rows','periodicity'],list(header.values())[0]))

Hope this is not too cryptic. Sometimes I cannot deny my strong C and PERL -heritage.

said previously it would be a bit more elegant if an api-method would provide this data-structure in the first place.

Access Data of a Table

To get the data of a Table you have to call “data.get()” to connect to the underlying data stream and load it. The data itself – the 2-dimensional list – is stored with the attribute body. In order to build again a DataFrame out of a table you need to accomplish the following steps:

tbl = data.get() – for creating a table instance from the data-message
tbl_info = api.type_context.get_vtype(tbl.type_ref) – for getting the data-type reference
col_names = list(tbl_info.columns.keys()) – for getting the column names via the vtype-reference stored in the table instance
df = pd.DataFrame(tbl.body, columns = col_names) – build the DataFrame

As an advanced option you can exert an dtype-conversion by using the tbl_info.columns-dictionary with the column-names as key that gives you the components/scalars of the table. These you can map to the pandas data types.

There is a standard conversion-to DataFrame planned that will relieve you from the above steps but it will of course come with performance price and might not exactly match to your kind of data. For a standard conversion not only the data types but also the NaN-values have to be processed.

Streaming-Support

Sometimes in particular for big data it makes sense to convey the data in batches rather in one chunk of data. For this you can create a writer and a reader of a stream and then read only the pieces you like to digest. You can mix both kind of operators, e.g. send all the data at once and the next operator only reads batches. This is what we are going to do here and change the script of the receiving operator:

import pandas as pd


def on_input(msg_id, header, data):
    
    header_dict = dict(zip(['index','max_index','num_rows','periodicity'],list(header.values())[0]))
  
    table_reader = data.get_reader()
    tbl = table_reader.read(2)
    tbl_info = api.type_context.get_vtype(tbl.type_ref)
    col_names = list(tbl_info.columns.keys())
    df = pd.DataFrame(tbl.body,columns = col_names)
    while len(tbl) > 0 :
        api.outputs.log.publish(f"Stream: {header_dict['index']} Num Records: {len(tbl)}",header=header) 
        tbl = table_reader.read(2)
        dft = pd.DataFrame(tbl.body,columns = col_names)
        df.append(dft, ignore_index=True)
        
api.set_port_callback("input", on_input)

The essential difference are the following lines

table_reader = data.get_reader()
tbl = table_reader.read(2)

instead of calling the “get”-method you create first a reader-instance and then read the number of data records you like until nothing is left in the stream. If you pass ‘-1’ then you get all data left in the stream. You do not have to care about the number of bytes. The given data-structures is used for doing the job automatically under the hood similar to a ‘readline’ of FileIO. This is definitely a very handy feature having missed in the past.

Snapshot

For big data processing pipelines that might run for days you always had to find a way to bookkeep the process status to be able to restart from the last process step instead from the very beginning. This you had to do for every pipeline again. Now with the snapshot feature it is part of each generation 2 operator. As you will see the additional effort is quite reasonable and you do not have to do it for each operator.

Marcel Oenning pointed out that in general whenever a pipeline restarts that has operators with a state that is essential for producing the correct output resilience is mandatory. It is not only matter of convenience when processing big data.

The mechanism is quite easy. When you start a resilient pipeline you need to pass the seconds you like the pipeline is doing a snapshot of its state and the number of restart trials within a certain period:

In the operator you have to define to new functions:

serialize – for saving the data
restore – for loading the data

and pass these to the corresponding api-methods

api.set_serialize_callback(serialize)
api.set_restore_callback(restore)

There are 2 other methods to be added but will not be configured

api.set_epoch_complete_callback(complete_callback)
api.set_initial_snapshot_info(api.InitialProcessInfo(is_stateful=True))

In our example I like to save the state of the first operator “Big Data Generator”, that means the index (=number of already generated data batches) and the generated DateFrame. In addition I like to save the information if the operator has already crashed because I like to infuse a toxic index that leads to a deliberately raised Exception. With this flag I ensure that the pipeline crashes only once.

def serialize(epoch):
    api.logger.info(f"Serialize: {index}  - {epoch} - Crashed already:{crashed_already}")
    return pickle.dumps([index,crashed_already,data_df])

def restore(epoch, state_bytes):
    global index, data_df, crashed_already
    index, crashed_already, data_df = pickle.loads(state_bytes)
    api.logger.info(f"Restore: {index}  - {epoch} - Crashed already:{crashed_already}")

Of course you have to ensure that all operator data that you like to save needs to be a global variable.

index = 0
crashed_already = False
data_df = pd.DataFrame()

The toxic code I add to the script:

# forced exception
    if index == api.config.crash_index and (not crashed_already) :
        crashed_already = True
        time.sleep(api.config.snapshot_time)
        raise ValueError(f"Forced Crash: {crashed_already}")

Before the Exception is raised the status “crashed_already” needs to be set and then some time is needed for the system to do a snapshot. The “sleep”-time has to be longer that the snapshot periodicity. This is obvious but I needed to learn this by 30min of tries and errors.

By the way the final pipeline looks like this

For a shortcut you can download a solution of the vtypes, operators and pipeline from my private GitHub.

Cross-engine data exchange

Finally with the Generation 2 operators you can connect all operators irrespective of the underlying subengine if the datatype matches. In our case if we like to save the generated data to File or Database we use the Structured Data Operators with all its convenience of data preview, mapping etc. I suppose this is what most of us are going to love.

In the next releases all of the commonly used operators will also be available as generation 2 operators and will boost the productivity and the robustness of the pipelines.

SAP Data Intelligence: Deploy your first HANA ML pipelines

Rating: 0 / 5 (0 votes)

The post Introduction Data Intelligence Generation 2 Operators appeared first on ERP Q&A.