TP3.2: Integrated Monitoring (Infrastructure, Services & Business)

The subproject Integrated Monitoring (TP 3.2) is concerned with operational and associated architectural aspects of mobility services platforms, in particular monitoring aspects at different levels of abstraction. Nurtured by the adoption of agile software development processes and advances in Big Data and machine learning technologies, DevOps seek to optimize end-user experiences and internal processes. DevOps and related stakeholders can be supported by an holistic view on architectural and runtime behaviors.

Figure 1: Platform Architektur Management
Providing an holistic view on complex platforms, IT landscapes or systems-of-systems is challenging, because architectural information such as hardware, software, business processes and their dependencies are often not explicitly documented. Another challenge is the diversity of specialized infrastructure monitoring, application monitoring and business process monitoring (QoS) or mining solutions complicate the selection process of particular monitoring solutions.
Moreover, the input and output of different monitoring solutions from different layers cover a wide range of different data formats (structured, semi-structured and unstructured) and hence are difficult to combine. Furthermore, the linkage between the runtime data extracted from the several abstraction layers is not provided by the current monitoring solutions. Most of these solutions specialize on a specific layer or requirement, like Business Process Monitoring, Application Performance Monitoring, Infrastructure Monitoring, or Log Mining. This makes it challenging to obtain an integrated and holistic view on the behavior and status of the platform as most tools either support only a technical viewpoint, or a business oriented viewpoint.

A major goal of the research project is to 1) investigate and provide best practices for monitoring software services, that are build on microservices and 2) develop a protoype for real-time multi-layer platform monitoring that establishes a link between the software and hardware components like the communication behavior, the dependencies between each components and the business processes that are supported by the services. On top of such monitoring solutions, the project goals is the investigation of vertical use cases such as root cause analysis, impact analysis and user transaction tracing where the combination of operational key performance indicators and events obtained from different abstraction levels (infrastructure, services, business and end-user) are taken into account, also in combination with architectural knowledge.

Figure 3: Use Cases
In the context of the proposed multi-layer monitoring solution we establish an EA tool enhanced with a Configuration Management Database (CMDB). The tool integrates a metamodel manager that applies an EA model as the design-time model that defines visual representations of all monitored components like business capabilities, business processes, software systems, hardware elements and their relations. The metainformation of the components, their relations and monitoring specific information like the path to log files are stored in the CMDB.

Figure 2: Multi-level Monitoring

Monitor the Infrastructure Layer
The infrastructure layer of the platform is monitored by deploying agents on the servers extracting status information about traffic, bandwith, CPU utilization, etc. Open source solutions like Nagios or Sensu fit for this purpose. However, these solutions do not provide information about what specific user transaction is accountable for a huge resource utilization or which application causes abnormal behavior. One approach to address this challenge is to leverage the linked information in the EA tool in order to correlate hardware metrics to running transactions by using the written timestamps as a connection key. In addition, log events written by the infrastructure and the application layer can uncover further important information about the behavior of these systems. Fur this purpose, we deploy the ELK stack for processing and indexing the log files. The information which IT component creates the log events and what user transaction is currently in process can be retrieved from the EA tool, passed to the Logstash pipeline and stored in Elasticsearch.

Monitor the Application Layer
The platform consist of microservices which provide a bulk of different backend and mobility services. These applications are constructed from collection of software modules that were developed by different teams, in different programming languages. A huge challenge for monitoring this configuration is to identify how and to what extent the microservices are communicating with each other in order to understand system behavior and reasons about performance issues.
In order to achieve this goal we apply the application performance monitoring (APM) approach “Dapper” described by Google and adapted by several academic projects like kieker and pivottracing, and open source projects like pinpoint, or zipkin. Dapper provides a solution for analyzing the overall structure of a system and how components within them are interconnected by tracing transactions across microservices without changing the application code. Each transaction contains a collection of span identifier (span id) that refer to a specific Remote Procedure Call (RPC). However, as the span ids are generated from scratch by default, the APM solution has to be modified in the way that the keys describing the specific component of the platform architecture are issued by the central key manager from the EA tool. Furthermore, it has to be assured, that this modification has no significant performance impact on the services.
In addition, the span identifier need to be assigned to log events which are written from the particular service. This can be realised by altering the Logstash configuration file.

Monitor the business process layer
The goal of business process monitoring is to extract business events that refer to a well-defined step in the business process activity from transaction logs. Hereby, it is challenging to know who performs the activity, what transaction events compose a whole activity and in particular when an activity has been started and finished. To address this challenge, we extend the EA tool with a business process manager that assists to define and manage business process events based on the trace information provided by the APM solution. Hereby, each relevant RPC refers to a business event and is mapped to one or more specific business activities that, in turn, compose a particular business case. However, in the first instance, the table for mapping RPC calls to predefined business process activities has to be done manually and kept always up to date.

– Kleehaus, M.; Uludağ, Ö.; Matthes, F.: Towards a Multi-Layer IT Infrastructure Monitoring Approach based on Enterprise Architecture Information, 2nd Workshop on Continuous Software Engineering: SE, Hannover, 2017
– Landthaler, J.; Kleehaus, M.; Matthes, F.: Multi-level Event And Anomaly Correlation Based on Enterprise Architecture Information, 12th International Workshop on Enterprise & Organizational Modeling and Simulation: EOMAS, Ljubljana, Slovenia, 2016

Jochen Graeff: Eine Kombination von Process Mining und Distributed Tracing zur Unterstützung der Fehler-Ursachen-Analyse