With virtualization and cloudization technologies becoming more and more mature, the cost and architecture advantages of distributed systems are becoming increasingly prominent. In particular, design concepts such as microservices are becoming more and more popular in business systems, especially large-scale Internet companies. The more complicated.
With the expansion of services and the splitting of services, the number of modules in the system has become more and more, and different modules may be maintained by different teams/programmers. A customer's business request may involve collaborative processing of several or even dozens of services, involving multiple team/programmer maintenance modules, different middleware such as caches, databases, and message queues. In such a cloudized application architecture, any request for a link to request a fault or a performance problem will seriously affect the service user experience. How can you quickly and accurately locate the cause of an online fault? How to capture performance bottlenecks in the request and implement optimization? How to associate discrete business request data for effective user experience analysis? For large-scale, high-traffic websites, social networking, e-commerce, and gaming applications, such problems are particularly prominent and directly affect the end user's perception of the system and its retention rate.
The traditional application operation and maintenance problem is mainly based on logs. Through analyzing the alarms, system resources, and logs one by one, the root cause or performance bottleneck of the fault is located. However, due to the complexity of the cloud architecture and the diversity of service request links, the traditional application operation and maintenance model cannot support fault location and performance analysis. This time you need the APM system to show your skills.
APM Definition and EvolutionAPM (ApplicaTIon Performance Management) is application performance management and belongs to the IT Maintenance and Management (ITOM) category. It is mainly aimed at the monitoring and optimization of IT application performance and user experience of key enterprise businesses, improving the reliability and quality of enterprise IT applications, ensuring that users receive good services, and reducing the total cost of ownership (TCO) of IT. APM has experienced the following three stages with the development of the Internet:
The first phase of APM occurred in the early days of the rise of the Internet . Due to the generally poor level of network infrastructure, application speed was very sensitive to the speed of the network and the performance of the underlying resources. At this stage, APM is network-centric, and considers that network speed applies to both speed and APM mainly monitors the host's CPU, I/O, memory, and network throughput.
The second phase of APM focuses on monitoring various basic components . With the development of the Internet, network applications have become more and more complex, and various basic components have become more and more, prompting APM to enter the state of health and availability of IT components. Performance monitoring is the second phase of the center.
In recent years, the rapid development of technologies such as mobile Internet, cloud computing, big data, and internet of things, various business applications have emerged, and the complexity of IT applications has grown exponentially. The “user first†attribute of Internet products themselves determines that the user experience becomes Key factors for the survival and development of various Internet products. How to improve the user experience, ensure the reliability and stability of services and products, optimize services, and other issues, and put forward new requirements for application performance management. Application performance management enters user experience as the core, and focuses on the high complexity of business transaction and application architecture. The third stage.
Based on the APM market analysis, Gardern made a new definition description of APM:
Under the new standard, the APM market has developed rapidly. APM monitors and manages the performance and availability of application services, helping application/service developers identify and locate performance bottlenecks and failures, ensuring that applications achieve the desired level of service and end-user experience.
Distributed Tracking TechnologyThe modern APM basically refers to Google's Dapper system to achieve. Dapper tracks the performance consumption of the application system in front-end and back-end processing and server-side invocation by tracking the processing of requests. Google published a paper "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" based on Dapper's implementation, which provides a valuable reference for the implementation of distributed tracking in the industry. This paper has also become the theoretical basis of the current distributed tracking system. . We can refer to the original Dapper paper for detailed understanding. This article only briefly introduces the principle.
As shown in the above figure, for each request invocation in the service chain, it is divided into clientSend (client sends a request), clientRecv (client receives a response), serverRecv (server receives a request), serverSend (server sends a response) Wait for four events and organize these four events into a data structure called Span. By defining the call (parent-child) relationship between Spans, discrete Span data can be reorganized to restore the complete call chain. The relationship between Spans is identified by traceId, parenTId, and spanId. traceId is the unique identifier of a complete call link, parenTId identifies the previous Span call to Span, and spanId is used to uniquely identify a call. Span's association in the trace link can be represented by the following figure:
Based on Google Dapper's idea of ​​restoring original links through traceid, parenTId, and spanid, many large Internet companies have developed their own call tracking systems, such as Twitter's Zipkin, Taobao's Hawkeye, Jingdong's Hydra, and open source PinPoint. Although the thinking is consistent, there are some differences in the selection of implant points.
Two Trends of Distributed Tracking Acquisition Technology
The application performance management system is mainly composed of data source, acquisition and transmission, analysis and calculation, and visual query. The core part is the data source. Through data collection from the client and server, the client's data collection technology mainly includes active dialing and passive buried point detection. No detailed description is provided here. This article mainly introduces the data collection technology of the server.
Data collection on the server side can be divided into two major categories:
· Network bypass monitoring, application performance analysis performed by the switch or network interface or application service deployment production network application traffic crawl. This method is less intrusive to applications or services and has less performance impact. However, this method has a large collection granularity and cannot provide code-level problem positioning. Under the secure transmission protocol, it cannot analyze requests or things.
• Probe buried point , application performance data collection through the deployment or embedded probe application on the production server. This approach can provide very complete and fine-grained monitoring data collection and provide code level problem location. However, this method is invasive for applications. If the code is buried abnormally, it will have an impact on the performance and stability of the application itself.
In buried data collection for applications and services, the method of buried probes is mainly used. The method of embedding the probe is mainly divided into two types, code intrusive buried by Zipkin and bytecode enhanced buried by PinPoint.
Zipkin and intrusive acquisition do not depend on framework ecological maturity
Zipkin is Twitter's open source distributed tracking system. Users help microservices collect time series data for potential problems, and provide the ability to call tracking data collection, storage, query, and dependency analysis. Zipkin is a distributed tracking system that does not have user experience analysis and application monitoring statistics. Zipkin uses the code to infiltrate buried sites, and the official provides a buried-point scheme based on the Finagle framework. The support of other languages ​​and frameworks mainly relies on community contributions. Current support includes Java, Scala, Node, Go, Python, Ruby, C#, and other mainstream languages ​​and frameworks. Code intrusion refers to the provision of developer calls by providing an SDK for application development, or providing a framework for integrating buried code. Some companies with framework R&D capabilities have chosen implantation sites like the Google in development frameworks or communication frameworks to ensure that applications based on unified framework development or communications naturally have the ability to bury points, and that there is no need to pay attention to buried point implementation and calls except for the framework development team. the way. The advantage of this method of burying points is that after using the framework, there is no need to pay extra attention to the ability to bury points, disguisedly reducing the cost of burial points. Twitter’s Zipkin and Taobao’s Hawkeye chose this method of burying.
At the same time, the industry also has a large number of buried equipment libraries, supporting the use of buried components to achieve the call chain data buried point. This kind of burying point method, by providing standard service frameworks, such as: Servlet, Spring MVC, Http Client, and common middleware, such as MySQL, Kafka, and other equipment types, by writing simple code and configuration, based on these standards Framework-building applications can output call chain report data. Brave provides a large number of standard framework implementations for this buried point approach. It also provides a very simple and standardized interface that supports customization and expansion when the above package implementation fails to meet business requirements.
The code invasive buried point has better scalability and facilitates the user to customize the type and level of data collected. However, regardless of the way of providing a framework for burying points or providing equipment libraries and SDKs, code intrusion is required. In the case of application development and framework upgrade scenarios, the application needs to re-edit the code. At the same time, for application developers, it is also difficult to accurately identify the places where they need to be buried, and the level of buried-point tracking based on code intrusion is low, and detailed operating state information cannot be obtained.
PinPoint and bytecode enhanced acquisitionDeeply buried to achieve non-invasive
Unlike Zipkin, PinPoint is an open source application performance management tool that uses byte code enhancements for data source collection. Currently, only the official Java Agent probe is provided. Bytecode-enhanced buried-point methods advocate non-intrusive code, different programming languages, and implanted in language runtime environments or basic libraries through different technologies. For Java applications, byte code enhancement technology is used to cover different communication protocols, middleware, and development frameworks through different embedded plug-ins when starting the JVM, and function-level caching is performed on Java-based call code. The advantage of this method is that it can get stack-level call information and other more operating state information, helping users quickly complete problem location without the need for auxiliary tools such as logs.
PinPoint uses bytecode enhancement technology to collect APM data. By configuring the java agent probe at application startup, it actively intervenes the application code behavior. The application developer does not need to modify the code, and PinPoint determines which APIs the data is buried in. . Compared to PinPoint's byte-code enhancement technology and code-intrusive embedding of other APM systems, the byte-code enhancement technology can theoretically be buried anywhere, and is similar to Brave's equipment library and other invasive buried sites. The method itself depends on the implementation of the middleware. The application-level API provided by the application also requires the support of the underlying driver of the framework to achieve interception.
PinPoint considers performance optimization at the beginning of implementation, such as using Thrift's binary variable-length encoding format, using UDP as the transmission link, using data reference dictionaries when passing constants, and using asynchronous transmission methods. However, there are still some performance problems and constraints in use, and because the byte code enhancement technology has higher requirements for developers, it has certain disadvantages in terms of scalability and community ecology.
Huawei APM's technical practice zero-invasive full-cycle careHuawei APM combines the advantages of two typical systems, PinPoint and Zipkin, to provide a more convenient, more efficient, and cost-effective solution.
Non-intrusive data acquisition: one-click acquisition deployment, more efficient and robust data collection capabilities
Huawei APM probes use PinPoint acquisition probes to optimize data acquisition models, output component performance, and reliability. They also count the use of various frameworks and middleware in the industry and increase the support capabilities of plug-ins. In order to ensure the minimum resource consumption, provide users with the most useful performance analysis data.
· Automatic deployment of probes: Huawei APM supports the use of services such as Huawei cloud container engine and cloud application orchestration. It can be used to automatically deploy the collected probes through simple selection.
· Support for Zipkin model: Although both PinPoint and Zipkin are based on Google Dapper's thesis, the theoretical basis is roughly the same. However, there are still great differences in the data model of the call chain. In terms of openness and community activity, Zipkin has more advantages. To support Zipkin user access, Huawei APM probes support call chain data output according to Zipkin's data model.
· Data classification and optimization: For APM call performance statistics analysis (throughput, average delay, TPN, etc.), the industry common method is to use the call chain data for secondary extraction convergence. This method requires as many call chain data samples as possible to make the statistical data as accurate as possible, which will inevitably consume more application resources. To solve this problem, Huawei APM probes classify collected data sources: call chain data and KPI data. The KPI data is aggregated according to the cycle for each service request, and the output includes information such as the request initiator, request service provider, call transaction, call status (time-consuming, successful, or failed, etc.). Since the KPI data is periodically output and compared with the call chain data much smaller, the full amount of request acquisition and statistics can be realized under a small resource load.
· Accurate data collection: call chain data more attention to the call timeout (threshold supports custom) or call the exception call chain. Based on the basic sampling rate, Huawei APM triggers from the customer's actual operation and maintenance scenarios to provide accurate acquisition and dynamic configuration capabilities. Accurate acquisition supports the client to set timeout thresholds for application or transaction transactions, the number of periodic acquisition anomaly invocation samples, and normal invocation of samples during the cycle to reduce resource consumption while ensuring that data samples for exceptions or timeout requests meet the performance analysis requirements.
· Data transmission optimization: Optimizing the output components for high data output and high resource consumption for large data volumes. Optimizing the application resource consumption through asynchronous file output and asynchronous pipe output, output data caching, data type reduction, etc. .
· Collecting the escape mechanism: In the high-concurrency peak scenario, there are many application service requests and resource consumption. In this case, to ensure the normal operation of services, Huawei APM supports user-defined configuration of escape resource thresholds. After the application resource consumption reaches the threshold, the Huawei APM probe actively stops the collection of all operation and maintenance data, and automatically resumes data collection when the resource consumption drops below the threshold. The escape mechanism supports dynamic configuration.
2. Digital Operation: Providing Service Operation Experience Management and Performance Analysis
Track every business transaction in real time, quickly analyze the operational status of the transaction and provide diagnostic capabilities
Custom transactions: Users can define transaction names based on each URL for easy understanding.
· Health rule configuration: You can configure health rules for each transaction.
Performance Tracking: Accurately collect abnormal performance data, compare historical baseline data, and find application anomalies to improve O&M efficiency.
3. Application analysis: application relationships and anomalies at a glance, fault drilling
· Application Discovery and Dependency: Accurately collect abnormal performance data, compare historical baseline data, and find application-specific exceptions to improve operational efficiency.
· Application KPI aggregation: Micro service instances are aggregated into applications and KPI data is automatically aggregated into applications.
4. Application tracking: Tracking abnormal business call chains, delimiting fast problems
Support platform, resource, application monitoring and micro service call chain analysis:
· Mass data scale support: support millions of container monitoring, second-level query response.
• Drill Down: Clicking on a fault node automatically drills down to the failed microservice instance, and can also be associated with a failed call chain and call stack to see the input and return values ​​of the failed function.
Front Service Led Display,Outdoor Advertising Screen,Led Screen Billboard,Outdoor Led Advertising
ShenZhen Megagem Tech Co.,Ltd , https://www.megleddisplay.com