No SQL data (XML, Json, Avro, Copybook Cobol…) with simplicity

 



Data file management is a topical issue, especially in the era of Big Data and NoSQL.

More and more data is stored in flat file or hierarchical file structures (Json, XML, etc.), some of which are very specific to the field of Big Data such as Avro. These hierarchical structures are either available directly in the file system (Windows, Linux, Hdfs, Amazon, Azure, etc.) or encapsulated in third-party technology (Elastic Search, Mongo DB, Big Query, Snowflake, Teradata, etc.).

Discover the features of Stambia which allow you to speed up projects requiring the use of files and hierarchical structures.


 
Read and Write hierarchical structures and files XML, JSON, AVRO with Stambia ETL / ELT

XML, JSON, Avro… Why is it sometimes complicated?

Traditional ETL solutions adapted to simple formats

Traditional data integration solutions have very often been designed for data in tabular format.

Hierarchical data management is achievable but costly in terms of development time and sometimes ineffective in terms of performance.

It is not uncommon to waste a lot of time and energy in projects handling simple hierarchical data (files from large systems, files with Copybook Cobol, NoSQL data like JSON or Avro, data from web services in XML, etc.).

Données hiérarchiques (Json, Avro) dans l'analytics
 

Data volume issues

Data volume (Json, Avro, XML) ETL EAI

Things often get complicated when the volume of data increases.

Two scenarios can arise:

  • Each file is large and requires a significant increase in machine power, especially memory, so that the file can be read without failure of the ELT.
  • The number of files is large and the parallelization mechanisms are not efficient. As a result, the time required to process a batch of files can prove to be prohibitive or penalizing for operating teams.

 

 

XML, JSON ... Data formats that are not always easy to understand

Finally, not everyone is a specialist in hierarchical technologies, particularly Web Services or very specific formats like Avro or Json.

The manipulation of such structures with traditional or Open Source solutions can require significant technical skills and slow the reaction time compared to a business request.

Every little peculiarity (a type of data not included or a specific format of the file) can lead to waste a lot of time..

Données hiérarchiques (Json, Avro) dans l'analytics
Finally, not everyone is a specialist in hierarchical technologies, particularly Web Services or very specific formats like Avro or Json.
The manipulation of such structures with traditional or Open Source solutions can require significant technical skills and slow the reaction time compared to a business request.
Every little peculiarity (a type of data not included or a specific format of the file) can lead to waste a lot of time.

How does Stambia ETL manage hierarchical structures?

 

1. Simplifying the use of hierarchical data with Stambia ETL

Data representation by metadata

File management in Stambia is simple.

Many wizards help the user recover metadata. They are adapted to each technology, taking into account each specificity.

Where technology allows it, Stambia will offer to use specific reverse engineering standards (XSD, DTD, WSDL, etc.) When it is not the case (more free format), the assistant will propose to use example data in order to recover the maximum of information.

At any time, the user can correct and add his own information in order to have a description of the objects that is most faithful to the data he will have to process..

Données hiérarchiques (Json, Avro) dans l'analytics

Exploring the hierarchical data

The Stambia Designer allows to read directly hierarchical data types JSON or Avro, with a specialized editor.

In the case of other hierarchical files, the unequal connector, which is a JDBC driver, can read or write files of delimited or fixed length and can handle events within the same line (variable length of the same line type).

Once the description of the file has been made, it is possible to perform simple SQL commands to read the file as if it were composed of several tables (see image opposite).)

 

2. Reading and writing hierarchical data efficiently

Reading / loading hierarchical data with Stambia

Stambia's universal mapping provides the best way to read complex files in order to load multiple targets, with a very high level of performance.

Indeed, the "multi-target" approach to mapping remains simple, while making it possible to load multiple targets from a single file. The files will be loaded at once, but the data will be sent at the same time (or according to the requested sequence) to several targets.

This provides a high level of performance for reading, using and transforming hierarchical source data.

This approach is also data-oriented. It allows the user to focus on the link between his data and not on the technical process which is necessary for the realization of the mapping.

Writing hierarchical data with Stambia

Stambia can integrate the data in a hierarchical file or structure in a single mapping, regardless of the complexity of the structure addressed.

In the example provided, the target of the mapping is an XML file composed of several hierarchies.

While remaining legible, a single mapping will produce a single hierarchical file, which may contain very deep hierarchies, or multiple occurrences of the same elements, or even juxtaposed hierarchies.

Like reading data, this approach is also data oriented (data-centric). It allows the user to focus on the link between his data and not on the technical process which is necessary for the realization of the mapping.

This approach is also very useful when using Web Services or APIs which use this type of hierarchical data for inputs or outputs.

 

3. Automating and industrializing with Stambia ETL

Industrializing the reading or writing of files in directories

Manipulating files sometimes requires iterating the same operations on several identical files.

For example, reading several identical files in source and iterating: several orders or batches of orders to integrate into an ERP or CRM. Or even generating several files from the same source dataset: generating e a file by city or by supplier.
These operations can be complex with traditional solutions.

Stambia offers many features that automate these processes, notably the possibility to manage in a mapping the directory or file level in order to automate (without the use of additional technical processes) the batch reading or writing of hierarchical structures.

This approach makes it possible to keep a business-oriented (data-centric) vision of developments, and above all, to guarantee optimal performance during batch processing of large files.

Automatically replicating hierarchical data (XML, JSON, Avro…)

 
 

Integration of source files into a target can also be automated using the replication component.

This component allows you to browse a directory and massively integrate files into a relational database or any other structured target.

In this case, there is no mapping or development. The replicator is able to create a relational structure (or other) from a hierarchical file structure and to complete the database with files that have been found in the folder.

This type of component can incorporate incremental integration mechanisms (with calculation of the differences) to integrate data in a target coherently and without duplicates.

File management in Stambia is simple.
Many wizards help the user recover metadata. They are adapted to each technology, taking into account each specificity.
Where technology allows it, Stambia will offer to use specific reverse engineering standards (XSD, DTD, WSDL, etc.) When it is not the case (more free format), the assistant will propose to use example data in order to recover the maximum of information.
At any time, the user can correct and add his own information in order to have a description of the objects that is most faithful to the data he will have to process.

Technical Specifications

SpecificationDescription

Simple and agile architecture

  1. Designer: development environment
  2. Runtime: engine for executing data integration processes, Web services, ...
  3. Production Analytics: consultation of the executions and putting into production

Protocol

HDFS, GCS, Azure Cloud

HTTP REST / SOAP

Data Format

XML, JSON, AVRO, and any specific format

ASCII, EBCDIC, Packed amounts, Parquet, ...

Connectivity

You can extract or integrate data from:

  • Any relational database system like Oracle, PostgreSQL, Microsoft SQL Server (MSSQL), MariaDB, ...
  • Any NoSQL database system like MongoDB, Elasticsearch, Cassandra, HBase, ...
  • Any high performance database system like Netezza, Vertica, Teradata, Actian Vector, Sybase IQ, ...
  • Any Cloud system like Amazon Web Service (AWS), Google Cloud Platform (GCP), Microsoft Azure, Snowflake, ...
  • Any ERP application like SAP, Microsoft Dynamics, ...
  • Any SAAS application like Salesforce, Snowflake, Big Query, ...
  • Any Big Data system like Spark, Hadoop, Hive, Impala ...
  • Any MOM, ESB messaging system like Apache Active MQ, Kafka, OpenJMS, Nirvana JMS, ...
  • Any file system like CSV, XML, JSON, ...
  • Any spreadsheet system like Excel, Google Spreadsheet, ...

For more information, consult our technical documentation

Technical Connectivity
  • FTP, SFTP, FTPS
  • Email (SMTP)
  • LDAP, OpenLDAP
  • Kerberos

Standard Characteristics

  • Reverse: the structure of the database can be reversed thanks to the concept of reverse metadata
  • DDL / DML operations: supports operations for manipulating objects and data (DML / DDL) such as inserting, updating, deleting, etc. (Insert, Update, Select, Delete, Create or Drop)
  • Integration method: Append, Incremental Update
  • Staging: a database can be used as an intermediate step (staging area) for data transformation, reconciliation, etc.
    The supported modes are:
    • staging as subquery
    • staging as view
    • staging as table
  • Rejection: rejection rules can be defined to filter or detect data that does not meet the conditions defined during the integrations.
    • 3 types of rejection can be created: Fatal, Warning, Reject
    • Differentiated processing according to the type of data for each rejected data
    • Recycling of rejects created during previous runs
  • Replication: database replication is supported from any source such as relational or NoSQL databases, flat files, XML / JSON files, Cloud system, etc.
Advanced characteristics
  • Slowly Changing Dimension (SCD): integrations can be achieved using slow changing dimension changes (SCD)
  • Charging methods:
    • Generic load
    • COPY loader
  • Change Data Capture (CDC)
  • Privacy Protect: module for managing the GDPR with the functionalities:
    • Anonymization
    • Pseudonymization
    • Audits
    • ...
  • Data Quality Management (DQM): data quality management directly integrated into metadata and in the Designer
Technical prerequisites
  • Operating system :
    • Windows XP, Vista, 2008, 7, 8, 10 in 32 or 64 bit mode
    • Linux in 32 or 64 bit mode
    • Mac OS X in 64-bit mode
  • Memory
    • At least 1 Gb of RAM
  • Disk space
    • At a minimum, there must be 300 MB of available disk space
  • Java environment
    • JVM 1.8 or higher
  • Notes: for Linux, it is necessary to have a windowing system GTK + 2.6.0 with all the dependencies
Cloud Deployment Image Docker disponible pour les moteurs d'exécution (Runtime) et la console d'exploitation (Production Analytics)
Supported Standards
  • Open API Specifications (OAS)
  • Swagger 2.0
  • W3C XML
  • WSI compliant
  • SQL

Scripting Language

Jython, Groovy, Rhino (Javascript), ...
Source Manager Any Eclipse-supported plugin : SVN, CVS, Git, ...

Want to know more?

Consult our resources

Anonymisation
Ask advices to our experts in Data integration.
Contact us
Anonymisation
Your customized demonstration
Get your demo
Anonymisation
Read our eBook : Management of hierarchical data in a simple way
Download

Subscribe to Stambia newsletter

Sign up for our email newsletters and stay connected to the best ideas in data integration

Send me the Stambia newsletter. I expressly agree to receive the newsletter and know that I can easily unsubscribe at any time.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.