No SQL data (XML, Json, Avro, Copybook Cobol…) with simplicity

Data file management is a topical issue, especially in the era of Big Data and NoSQL.

More and more data is stored in flat file or hierarchical file structures (Json, XML, etc.), some of which are very specific to the field of Big Data such as Avro. These hierarchical structures are either available directly in the file system (Windows, Linux, Hdfs, Amazon, Azure, etc.) or encapsulated in third-party technology (Elastic Search, Mongo DB, Big Query, Snowflake, Teradata, etc.).

Discover the features of Stambia which allow you to speed up projects requiring the use of files and hierarchical structures.

Read and Write hierarchical structures and files XML, JSON, AVRO with Stambia ETL / ELT

XML, JSON, Avro… Why is it sometimes complicated?

Traditional ETL solutions adapted to simple formats

Traditional data integration solutions have very often been designed for data in tabular format.

Hierarchical data management is achievable but costly in terms of development time and sometimes ineffective in terms of performance.

It is not uncommon to waste a lot of time and energy in projects handling simple hierarchical data (files from large systems, files with Copybook Cobol, NoSQL data like JSON or Avro, data from web services in XML, etc.).

Données hiérarchiques (Json, Avro) dans l'analytics

Data volume issues

Things often get complicated when the volume of data increases.

Two scenarios can arise:

Each file is large and requires a significant increase in machine power, especially memory, so that the file can be read without failure of the ELT.
The number of files is large and the parallelization mechanisms are not efficient. As a result, the time required to process a batch of files can prove to be prohibitive or penalizing for operating teams.

XML, JSON ... Data formats that are not always easy to understand

Finally, not everyone is a specialist in hierarchical technologies, particularly Web Services or very specific formats like Avro or Json.

The manipulation of such structures with traditional or Open Source solutions can require significant technical skills and slow the reaction time compared to a business request.

Every little peculiarity (a type of data not included or a specific format of the file) can lead to waste a lot of time..

Finally, not everyone is a specialist in hierarchical technologies, particularly Web Services or very specific formats like Avro or Json.
The manipulation of such structures with traditional or Open Source solutions can require significant technical skills and slow the reaction time compared to a business request.
Every little peculiarity (a type of data not included or a specific format of the file) can lead to waste a lot of time.

How does Stambia ETL manage hierarchical structures?

1. Simplifying the use of hierarchical data with Stambia ETL

Data representation by metadata

File management in Stambia is simple.

Many wizards help the user recover metadata. They are adapted to each technology, taking into account each specificity.

Where technology allows it, Stambia will offer to use specific reverse engineering standards (XSD, DTD, WSDL, etc.) When it is not the case (more free format), the assistant will propose to use example data in order to recover the maximum of information.

At any time, the user can correct and add his own information in order to have a description of the objects that is most faithful to the data he will have to process..

Exploring the hierarchical data

The Stambia Designer allows to read directly hierarchical data types JSON or Avro, with a specialized editor.

In the case of other hierarchical files, the unequal connector, which is a JDBC driver, can read or write files of delimited or fixed length and can handle events within the same line (variable length of the same line type).

Once the description of the file has been made, it is possible to perform simple SQL commands to read the file as if it were composed of several tables (see image opposite).)

2. Reading and writing hierarchical data efficiently

Reading / loading hierarchical data with Stambia

Stambia's universal mapping provides the best way to read complex files in order to load multiple targets, with a very high level of performance.

Indeed, the "multi-target" approach to mapping remains simple, while making it possible to load multiple targets from a single file. The files will be loaded at once, but the data will be sent at the same time (or according to the requested sequence) to several targets.

This provides a high level of performance for reading, using and transforming hierarchical source data.

This approach is also data-oriented. It allows the user to focus on the link between his data and not on the technical process which is necessary for the realization of the mapping.

Writing hierarchical data with Stambia

Stambia can integrate the data in a hierarchical file or structure in a single mapping, regardless of the complexity of the structure addressed.

In the example provided, the target of the mapping is an XML file composed of several hierarchies.

While remaining legible, a single mapping will produce a single hierarchical file, which may contain very deep hierarchies, or multiple occurrences of the same elements, or even juxtaposed hierarchies.

Like reading data, this approach is also data oriented (data-centric). It allows the user to focus on the link between his data and not on the technical process which is necessary for the realization of the mapping.

This approach is also very useful when using Web Services or APIs which use this type of hierarchical data for inputs or outputs.

3. Automating and industrializing with Stambia ETL

Industrializing the reading or writing of files in directories

Manipulating files sometimes requires iterating the same operations on several identical files.

For example, reading several identical files in source and iterating: several orders or batches of orders to integrate into an ERP or CRM. Or even generating several files from the same source dataset: generating e a file by city or by supplier.
These operations can be complex with traditional solutions.

Stambia offers many features that automate these processes, notably the possibility to manage in a mapping the directory or file level in order to automate (without the use of additional technical processes) the batch reading or writing of hierarchical structures.

This approach makes it possible to keep a business-oriented (data-centric) vision of developments, and above all, to guarantee optimal performance during batch processing of large files.

Automatically replicating hierarchical data (XML, JSON, Avro…)

Integration of source files into a target can also be automated using the replication component.

This component allows you to browse a directory and massively integrate files into a relational database or any other structured target.

In this case, there is no mapping or development. The replicator is able to create a relational structure (or other) from a hierarchical file structure and to complete the database with files that have been found in the folder.

This type of component can incorporate incremental integration mechanisms (with calculation of the differences) to integrate data in a target coherently and without duplicates.

File management in Stambia is simple.
Many wizards help the user recover metadata. They are adapted to each technology, taking into account each specificity.
Where technology allows it, Stambia will offer to use specific reverse engineering standards (XSD, DTD, WSDL, etc.) When it is not the case (more free format), the assistant will propose to use example data in order to recover the maximum of information.
At any time, the user can correct and add his own information in order to have a description of the objects that is most faithful to the data he will have to process.

Technical Specifications

Specification	Description
Simple and agile architecture	Designer: development environment Runtime: engine for executing data integration processes, Web services, ... Production Analytics: consultation of the executions and putting into production
Protocol	HDFS, GCS, Azure Cloud HTTP REST / SOAP
Data Format	XML, JSON, AVRO, and any specific format ASCII, EBCDIC, Packed amounts, Parquet, ...
Connectivity	You can extract or integrate data from: Any relational database system like Oracle, PostgreSQL, Microsoft SQL Server (MSSQL), MariaDB, ... Any NoSQL database system like MongoDB, Elasticsearch, Cassandra, HBase, ... Any high performance database system like Netezza, Vertica, Teradata, Actian Vector, Sybase IQ, ... Any Cloud system like Amazon Web Service (AWS), Google Cloud Platform (GCP), Microsoft Azure, Snowflake, ... Any ERP application like SAP, Microsoft Dynamics, ... Any SAAS application like Salesforce, Snowflake, Big Query, ... Any Big Data system like Spark, Hadoop, Hive, Impala ... Any MOM, ESB messaging system like Apache Active MQ, Kafka, OpenJMS, Nirvana JMS, ... Any file system like CSV, XML, JSON, ... Any spreadsheet system like Excel, Google Spreadsheet, ... For more information, consult our technical documentation
Technical Connectivity	FTP, SFTP, FTPS Email (SMTP) LDAP, OpenLDAP Kerberos
Standard Characteristics	Reverse: the structure of the database can be reversed thanks to the concept of reverse metadata DDL / DML operations: supports operations for manipulating objects and data (DML / DDL) such as inserting, updating, deleting, etc. (Insert, Update, Select, Delete, Create or Drop) Integration method: Append, Incremental Update Staging: a database can be used as an intermediate step (staging area) for data transformation, reconciliation, etc. The supported modes are: staging as subquery staging as view staging as table Rejection: rejection rules can be defined to filter or detect data that does not meet the conditions defined during the integrations. 3 types of rejection can be created: Fatal, Warning, Reject Differentiated processing according to the type of data for each rejected data Recycling of rejects created during previous runs Replication: database replication is supported from any source such as relational or NoSQL databases, flat files, XML / JSON files, Cloud system, etc.
Advanced characteristics	Slowly Changing Dimension (SCD): integrations can be achieved using slow changing dimension changes (SCD) Charging methods: Generic load COPY loader Change Data Capture (CDC) Privacy Protect: module for managing the GDPR with the functionalities: Anonymization Pseudonymization Audits ... Data Quality Management (DQM): data quality management directly integrated into metadata and in the Designer
Technical prerequisites	Operating system : Windows XP, Vista, 2008, 7, 8, 10 in 32 or 64 bit mode Linux in 32 or 64 bit mode Mac OS X in 64-bit mode Memory At least 1 Gb of RAM Disk space At a minimum, there must be 300 MB of available disk space Java environment JVM 1.8 or higher Notes: for Linux, it is necessary to have a windowing system GTK + 2.6.0 with all the dependencies
Cloud Deployment	Image Docker disponible pour les moteurs d'exécution (Runtime) et la console d'exploitation (Production Analytics)
Supported Standards	Open API Specifications (OAS) Swagger 2.0 W3C XML WSI compliant SQL
Scripting Language	Jython, Groovy, Rhino (Javascript), ...
Source Manager	Any Eclipse-supported plugin : SVN, CVS, Git, ...

Want to know more?

Consult our resources

Tutorial - Template Reject Management

Open Video

Your customized demonstration

Get your demo

Read our eBook : Management of hierarchical data in a simple way

Download

Ask advices to our experts in Data integration.

RESOURCE_READ_MORE

Did not find what you want on this page?
Check out our other resources: