Skip to content

Modern Quality Engineering: The 8 AI and ML applications in QE

Lets explore AI/ML technology in the Quality Engineering space that is currently available, not hypothetical, and certainly in use within organizations that embrace Modern Quality Engineering.

Humanity’s Fascination with Artificial Intelligence

Our fascination with AI may have resulted from both the fear of the unknown and the possibilities of the uncharted.

There are, of course, more comprehensive materials and references that discuss and define AI at length, but brought to the very fundamental aspects, a true superintelligent AI is something that learns not a very specific field or function, but can learn anything about anything, as a person would, if he or she would be motivated enough and has infinite time. An AI that perhaps even chooses its own function. It exists with the capacity to choose for itself what to do and how to do it. It might feel like a dangerous notion to explore, but we can never truly prevent innovation, and if it happens then it may very well be that singularity. A definitive event that changes the course of humanity.  Maybe it is good that this is still very much science fiction as of this writing.

The Role of Artificial Intelligence in Modern Quality Engineering

In this e-book, we explore AI/ML technology in the Quality Engineering space that is currently available, not hypothetical, and certainly in use within organizations that embrace Modern Quality Engineering. While I have received initial reluctance from clients that “this topic is too advanced for us” or that “we are not mature enough for this”, it may be a surprise that success in applying this technology and leveraging relevant tools is not associated or restricted to the biggest companies or most mature organizations, but those that are strategic and thoughtful about their QE and automation efforts. It may even be counter-intuitive, but we have applied the same technology to accelerate and leapfrog our clients’ quality maturity and capability journey even as early as when they are just barely starting dedicated QE testing.

Organizations Playing Catch-Up Will Benefit the Most from ML Technology

In fact, the biggest successes would come from organizations that are early in their efforts to test and automate, and not those which are well on their way to advanced automation and highly mature and stable. The benefits generated from leveraging the technology lends itself to support other transformation initiatives and improvement activities. In other words, ‘jumping the QE maturity line’ will actually be better for teams especially when they are just starting out on this journey. Done right, they will catch up to process-mature organizations even within the first year. And the technology that makes this possible, that evens the playing field more than anything else, is Artificial Intelligence.

AI, or specifically Machine Learning, is the pinnacle of technology in Quality Engineering today. From our deep experience in transforming and evolving Quality Engineering programs at client environments, we have seen that the typical QE maturity journey for many organizations are represented in 7 distinct steps as illustrated in FIGURE 1. We focus on the final step in this third part of our Modern Quality Engineering series and there are preceding milestones that are discussed at length in previous e-books: The Modern Quality Engineering eBook and The Road to HyperAutomation.

It is understandable to assume that ML-Based Automation is characterized by a single capability, but it’s not as simple as that. There are many applications of Machine Learning in Quality Engineering and we will explore the most common of them here:

  1. Self-Healing Automation
  2. Exploratory Testing
  3. Coverage Optimization
  4. Visual Compare
  5. Test Generation
  6. Anomaly Detection
  7. Root Cause Analysis
  8. Performance Tuning

ML Application 1: Self-Healing Automation

Self-Healing Automation might be the most common of all applications of Machine Learning in testing.

There are a good number of tools in the market today that focus on enabling self-healing automation, and that is because it tackles one of the most time-consuming activities in test automation. Test script maintenance is quite possibly the most challenging activity related to test automation. In fact, most organizations which have self-started their test automation efforts with no experience or outside help end up consuming all their automation capacity just maintaining existing test scripts, never getting enough bandwidth to work on new ones after getting to a certain point. That is because most test cases are brittle and flaky, and without proper test design and framework consideration, every small functionality update and code change will yield false positives, or test failures not associated with actual application bugs. In fact, what we’ve seen is that in a typical organization, false positives will comprise of more than 70% of test failures. That means more effort is actually spent investigating and fixing tests that break, instead of true application defects. This is because test automation scripts use specific locators to identify objects (or Document Object Model elements) on the UI – if you have done any amount of automation, you would be familiar with these – your XPath, ID, CSS, Tag, Name, etc. What makes test cases brittle is when what you use to identify your DOM element has been changed, and the test is not able to execute because it cannot find the target DOM element.

Self-Healing Automation Prevents and Repairs Automation Flakiness

Machine Learning comes into the picture to reduce these false positives by first working with software that has a recording capability for automatically identifying a set of locators at once, instead of having the test automation scripter specific one. Some of these tools even use over a dozen locators for each object. The combination of locators is then used by the AI to ascertain that should one or more of them change, is that sufficient to allow a test to fail – or is more likely that the change in locator data is more likely to be an intentional code change. If the AI decides it is more of the latter, it will flag it and raise it as a warning and proceed with running the rest of the test as far as it can, instead of breaking and calling a failure. This helps significantly especially when the volume of tests several hundred to a few thousand, since it alleviates some of the effort put into treating all failures as potential defects and having to investigate it one by one. However, other tools go a step further, and ‘self-heals’ or corrects the script itself, considering that the change is code is expected and the locators should now be updated to reflect that code change. A log is created to allow reverting to the previous version, and this has saved even more effort of needing to update and re-run the script. A huge benefit again, for organizations with thousands of tests and dozens of false positives every run.

ML Application 2: Exploratory Testing

Two coworkers conducting quality testing at their computers
Many QE professionals view traditional exploratory testing as an easy alternative or a last resort, mainly because it does not necessitate creating a plan or documenting steps and results with any significant detail.

It certainly serves a purpose, especially when teams are pressed for time, and when certain guardrails are established to help facilitate a more structured approach – as in the case of session-based testing. Regardless, it is certainly not the most trusted and reliable way to ensure test coverage, even when application experts are involved. Another difficulty lies in the ability to coordinate across multiple testers – when there’s no structure, who’s testing what? There’s sure to be redundant effort and worse, coverage gaps.

As you’d expect, that’s exactly where machine learning comes in – while the machine might traverse the application like a smarter web crawler, the disadvantages towards exploratory testing are completely upended because each run is tracked, and ML is used to facilitate additional runs in a way that optimizes test coverage per run. Some tools also allow adding weights to some steps to manually ‘train’ the algorithm. AI has now made ‘random’ exploratory tests a lot smarter, more methodical and vastly more efficient.

AI Solves Coverage Gaps Inherent in Traditional Exploratory Testing

There are different techniques to use when working on coverage optimization as illustrated in the next section – some would have you assume that you will start from nothing and will build coverage up intelligently, some start from having too many tests and reducing them to what’s feasible given fixed constraints. This technique, as the name suggests, assumes that there is some coverage built but would have gaps in it.

There are multiple ways to define Coverage (Code Coverage, Functional Coverage, Requirement Coverage, Process Coverage, etc.) but common among all of them is the challenge of understanding or mapping what is 100% coverage in the first place. Many organizations have struggled in the past to define how much their tests truly cover because it is virtually impossible to map out every possible way to test an application. However, using ML in exploratory testing allows for an automated and smarter way for crawlers to paint a picture of every element, page, and state in the application and could be used as a basis for the 100% number, even just from a UI navigation perspective. This doesn’t account other factors that could be considered (such as compatibility coverage, logical coverage, or data coverage) but it could be a start to identifying and quantifying gaps in coverage which until now has been elusive for many teams.

ML Application 3: Coverage Optimization

Quite possibly the most complex application to explain, coverage analysis and optimization is another area where Machine Learning accelerates cognitive, analytical, and decisioning tasks in software testing.

The ability to carefully select tests and optimize coverage is crucial, especially that the pressure to speed up releases and reduce time to market limits the available time window to test in most organizations. In an ideal world, the proper optimization of coverage against capacity and schedule (also known as test selection) considers multiple factors. This topic alone could fill another eBook by itself, but in summary the five main dimensions to consider would be a combination of proper test design techniques applied, domain and application expertise, test environments available, proper capacity estimation and historical probability of failure.

Understanding Test Design Concepts and How Machine Learning Supports Them

  • Test Design Techniques – would represent different ways to generate intelligent test coverage. There are many techniques to consider, such as Equivalence Partitioning or Decision Tables – that allow one to minimize the required different input values for testing. A very simple example would be testing the login form – there are near unlimited combination of inputs that you can use to test with for the username field, but you don’t really need to do that – you can test with valid input (alphanumeric, or email format) and invalid input (spaces, special characters, scripts). This can also be expanded with additional test reduction principles such as orthogonal array testing and pairwise testing, which intelligently focuses the testing on unique combinations of data, instead of all possible combinations available. Regarding ML, there are tools that can do automated pairwise testing and support other test design techniques, but not a tool that can thoroughly analyze and apply what test design techniques are best for a situation.
  • Domain and Application Expertise – simply put, this is the culmination of experience in a certain industry and field, as well as experience working in the specific application. All other things equal, a tester with tenure will have a certain intuition and understanding of what to prioritize that leads to better test selection. At this time, Domain and Application expertise is fully in the human realm.
  • Real Traffic and Usage Analysis – closely related to understanding the application is analyzing the extent at which real users and customers are actually using each field, module, or functionality. It may be surprising to learn that some software companies actually don’t have this information available, much less use them as considerations for testing priority – but it is necessary to have a holistic approach to prioritization and selection. Even non-AI tools are used to track and glean insights from application usage patterns via monitoring tools, so you can be sure that AI tools are well-positioned to take it to the next level and those same insights to focus on providing recommendations on test coverage.
  • Test Environments – pertain to the target environment for where the test will be run. This is a major consideration for teams that use multiple test environments – development, integration, testing, staging, and production – and possibly several of each. It then becomes important to carefully think through where a certain type of testing (SIT, UAT, Performance) would be executed. At this point, test environments, while easier to manage with containerization and Infrastructure-as-Code, still rely on human decisions on what testing will be performed where. We have yet to see an AI that revises coverage decisions or turns on or turns off testing jobs in specific environments autonomously based on weighing risk and time available. Most likely, this is because it is an advanced strategic decision to make that it will give pause to veteran QE managers, but also there is massive risk of leaving this decision to a machine in relation to the time it actually saves, which could be a few days at most but most likely hours. This is one good example of, even if it is possible, it may not actually be valuable.
  • Capacity Estimation – there are many estimation techniques available out there, and during the age of the waterfall lifecycle, this used to be a very formal exercise. These techniques can be categorized as either ‘basic’ or ‘advanced’, and ‘individually done’ or ‘collaborative’. With agile development, the trend has been more towards ‘basic – collaborative’ techniques, such as planning poker and t-shirt sizing. There are tools that provide good estimates of how long test runs will take, and that ability supports capacity estimation – if you are given a time to test and have a ready estimate for different levels of testing, you can make a quick decision towards what can be feasibly run within that timeframe.
  • Probability of Failure – this is a statistical exercise of looking at historical test run defects and incidents, and systematically categorizing them into areas of high vulnerability (also called high defect density). The resultant areas then become more important to test given its propensity to fail, for whatever reason. ML capability to derive test coverage optimization recommendations (and even to automatically incorporate them) based on pre-production defects and historical test runs is already existing in a handful of automation tools today. However, the same capability currently does not exist for production defects and live issues, with one big hurdle being that you don’t want to test extensively in production, so you won’t have a lot in terms of historical test runs either. The best source of information is still production tickets, but many organizations have struggled to properly analyze and classify all issues to the point that it can be actionable to quality engineers. The ML version of this may be even further out of reach.

ML Application 4: Visual Compare

Computer Vision, or Object Vision must be one of easiest applications of Machine Learning to recognize.

If you see a Tesla using its autopilot (or any autonomous vehicle for that matter), you’d understand quickly that AI must be in place for the car to decide how to proceed based on external stimuli – maybe it’s another car, a sign, or a traffic light – that the car must have seen, and then understood what those stimuli meant before it arrived at a decision to keep going, to stop, or to speed up. In the same way, AI is used by some tools to do a visual compare of different versions of the application. Instead of writing scripts with assertions against DOM elements in the UI, the visual compare algorithms are used to evaluate and tag potential defects as the tool looks at previous iterations of the UI’s appearance and compares it to the current version (the version under test) to identify changes and to flag potential issues.

ML Support Visual Comparison At Multiple Levels

Of course, there is much more nuance to this than just AI tagging and classifying images visual discrepancies. There are many ways in which ML algorithms find and get more accurate in determining if a visual discrepancy is substantial enough to be raised as a defect, and most tools offer the ability to further train and control this activity. You can also change the granularity and sensitivity of the comparison, such as looking only at page structure vs looking at actual pixel movements. In addition, areas where the comparison is done can be adjusted to avoid items that will always change – e.g., a marketing banner, news and content feeds, or piece of copy that changes languages according to the user’s location. This is particularly helpful to combat the localization needs of companies operating in multiple countries, and important where compatibility across a wide variety of devices, operating systems, and browsers are necessary – like in mobile native and mobile web testing. This type of capability is especially crucial for consumer industries such as retail and hospitality, where overall consistent experience is top priority.

ML Application 5: Test Generation

Test Generation has very close association with exploratory testing, in fact some would find it hard to tell the difference.

One can argue that since automated exploratory testing autonomously ‘creates tests’ as it proceeds to traverse the target application and should fit into the test generation group. We consider these two items separate, as we look at exploratory testing as a generation of a test run instance, and not necessarily the generation of test scripts itself. Test scripts need to be independent artifacts that allow for specific reuse, instead of a randomly generated set of actions that are not reusable, or possibly not even reproducible. Therefore, when we define test generation, we mean the actual creation of artifacts for targeted testing.

Test Generation via Machine Learning Takes Different Forms

With this definition, ML comes into play with tools that generate tests through a few different ways. Some tools use ML to create sophisticated maps of the application (not dissimilar to the ‘Exploratory Testing’ application above) and transforms them into test scripts. Others only support test generation and do not make it fully autonomous by evolving ‘record-and-play’ scripts from simple object detection and capture to context-sensitive scripts with dozens of locators identified by the AI automatically (as in the case of ‘Self-Healing Automation’). Additionally, there are tools that use Natural Language Processing to generate tests based on either simple English instruction (e.g., click ‘Home’ button), or create them from BDD templates, or create them from manual test scripts in csv or spreadsheet format. The latter two might be useful for when a team is converting a substantial amount of manual test to automation, however it may not be as important after the initial effort. The first two are more regarded since they are continually useful in AI-driven test automation operations.

(Side note: for those who are newer to test automation, ‘record-and-play’ was a term originating from Mercury Interactive’s WinRunner, back when a vast majority of QA professionals were using these in the late 90s to early 00s – this is the ancestor for today’s MicroFocus Unified Functional Testing).

ML Application 6: Anomaly Detection

It is only in the recent years that QE professionals have truly accepted that quality in production is part of Modern Quality Engineering – in fact, many teams and organizations still don’t.

However, this will be a clear necessity as more and more companies move towards DevOpsification. As the lines of development and operations become blurry and even non-existent, the way we look at Quality Engineering should change accordingly.

One such change that needs to be considered and embraced is the understanding of product and application quality from the perspective of the user in live environments. Quality professionals have often said that the one true measure of product quality happens in production, yet we have seen many quality engineering efforts that do not interact with, much less learn from, what happens in production. In Modern Quality Engineering, there is continuous analysis of real usage that drives the improvement of quality controls, systems, and tests for current and future releases. If you are not learning and adjusting every release, there is a considerable gap in your quality strategy, and even if you are, are you learning and applying fast enough. The ideal state is that analysis, learning, and adjustment happens constantly and for current work – that every major issue in production shifts the way development and engineering is done even in small ways. This is a considerable gap and blind spot in many organizations today. Fortunately, ML technology allows teams to bridge this need more quickly than ever before. One big capability area towards this is Anomaly Detection.

Anomaly Detection is part of Modern Quality Engineering

Anomaly Detection, in simpler terms, is the ability to flag potential user experience issues by processing and correlating application usage data and evaluating outliers. It is usually driven by organizational KPIs and derived from application and infrastructure monitoring. In the case of retail, an anomaly might be the low conversion rate of traffic to purchases from a specific browser relative to other browsers. When looking at it from a Quality Engineering lens, this might signal compatibility issues of the application for that browser. This may be an obvious example, but the real world offers so much complexity and pulls in too many factors to consider – beginning with absolute data points such as time of day, device used, and location and even potentially touching on more abstract ones such as cultural or seasonal context. Maybe this browser is mostly used from a certain geographical area, and it was the geography that skews the data and is the underlying cause. Thankfully, that is where Machine Learning takes over and processes data in a way vastly more efficient and much faster than the capacity of human effort – ensuring a low time to detection, and better chances of resolving before users are noticeably impacted and business is disrupted. With a proper ML-driven anomaly detection system, these outliers are seen from different contexts, and then weighed and quantified in terms of criticality to elicit the proportionate response.

This is an organic next step for teams that have a great monitoring, alerting and response team in place – as with the other applications, the insights that the ML tool will provide become more relevant as more input data is provided. At the same time, the insights derived are futile if the alerts are irrelevant, redundant, or simply too many, or the process of triage, recovery and resolution are not good enough to keep up with the inflow of anomalies. But in the right hands, ML-based anomaly detection could revolutionize an organization’s business strategy.

ML Application 7: Root Cause Analysis

In many ways, automated Root Cause Analysis has a strong relationship, even an overlap, with Anomaly Detection.

Its effectiveness depends on monitoring systems being in place, having sufficient data and a stable organizational process for responding and resolving findings. It is not something that comes out as a major priority or a crucial issue to fix urgently either. Same as Anomaly Detection, Root Cause Analysis is an application of Supervised Machine Learning algorithms and is incredibly helpful in speeding up analysis activities that in the hands of humans can only go so fast or deep. However, we do see the need to separate the two for a couple of reasons.

How Root Cause Analysis Contrasts with Anomaly Detection

First, Anomaly Detection is more concerned about symptoms that appear in how users interact with the application, while Root Cause Analysis finds the answers in the data collected from systems under monitoring to fix an already defined issue. In that way, Anomaly Detection may go much more broadly in its hypothesis and scope, and you want it to, because it may lead up to a better understanding of your business in the eyes of your users. Root Cause Analysis starts with a known issue and is meant to drill down from a large set of potential problem areas until there is a single identified problem.

Secondly, Anomaly Detection is leveraged in the context of establishing relationships between experience and application and infrastructure metrics. In contrast, Root Cause Analysis usually pertains to finding the source of an issue in production – but it is also now being used in the context of finding root causes to defects during the development phase. Most of the tools currently available focus on the former as it should be since there is more intrinsic value in finding out production issues than there is in development defects. And as representatives for quality, our mindset should not be to create lines of demarcation, considering if items are out-of-scope for us and that someone else is accountable for – we do what we can to own it and rally the organization to solve it as a quality problem.

ML Application 8: Performance Tuning

The final application in Quality Engineering may not even be considered by many as something that belongs in the QE world.

However, where quality is an interconnected discipline towards mitigating risk and driving how well a product functions, a non-functional attribute like performance (more specifically, responsiveness and resilience of systems) seem ever more critical. Quality is quality, whether in production or in development, and when you customers have issues for any reason, that is a quality gap, whatever the cause is. And we have seen the realities of this in experience, especially for Fortune 500 companies – a small dip (0.5 – 1 second) in latency for transaction feedback or page load time results in tremendous loss of opportunity and impact (some reporting anywhere from 2-15% decrease in conversion rate or 10-20% traffic loss). That is the enormous significance of ensuring performant application architecture and infrastructure.

Many organizations have since heavily invested in auto-scaling cloud environments because of that need. The ability to containerize and deploy Infrastructure as Code also democratized environment provisioning across product teams in the organization. At any time, any developer can spin up and turn down an environment on demand, in a matter of seconds. But here’s the plot twist: what companies realize often too late is that all this wonderful autonomy creates a problem of spiraling cloud subscription and services cost.

The Modern Enterprise Depends on Elastic Infrastructure and requires Performance Tuning

That is where this final ML capability comes in – in the humanly impossible task of balancing cost to performance for cloud infrastructure, there is now AI that makes that decision for you and acts on it completely independently. The concept is straightforward: monitoring the application services via an APM tool and comparing it with historical traffic allows the tool to measure the current performance of the application towards a specific infrastructure configuration. It then tries other cloud configurations that may cost less or more, all the while tracking performance goals. Once fully initiated, it will settle in the cheapest possible configuration (such as lower memory, fewer CPU cores) that meets the thresholds, and restarts this balancing process for any changes in the services or the application. The cloud cost to performance optimization then becomes a continuously iterating and adjusting process without human intervention.  Some platforms even account for seasonality of data, and project and forecast throughput demand to prevent any degradation in performance in the first place. While one instance might mean just thousands of dollars over a span of the year, at the enterprise scale we have seen this technology return millions of cloud costs – all with virtually no active human intervention. This is really only feasible with ML – any such effort to do this manually just creates a bottleneck in the deployment process since services are constantly changing.


When evaluating ML-driven testing tools and platforms, many will claim to have AI and Machine Learning algorithms supporting their capability and that may be technically true. After all, even Optical Character Recognition counts as AI, but you can easily download a free and lightweight desktop tool that does that. At the opposite end, there are many tools that use more than one application described above or do a good job of creating synergistic algorithms that overlap across the definitions provided.  The more important question is not whether they use ML, but rather is it applied in such a way that it creates demonstrable value.

It is important to reflect on the reason why there are multiple takes on how to use this technology in the first place – that’s because each organization will have different goals, priorities, challenges, and circumstances that determine the most effective investment they could make towards this technology. One example, just to illustrate, is that a self-healing automation tool might be more appropriate for teams that have manual testers in place but not a lot of automation expertise available – whereas an organization that has a well-established site reliability engineering team will probably benefit more from automated Root Cause Analysis. Consider which AI application will jumpstart your automation efforts or considerably accelerate your capability towards meeting business goals and objectives. In other words, if it doesn’t generate value for your organization, then the endeavor has failed.

RCG’s Recommendation: Strategy Before Capability

A group of four coworkers discussing strategy while sitting around a laptop
We recommend taking a strategic approach to decide which ML application (and consequently, which tool) would serve the goals of the organization.

This is true of any tool and technology under the sun, and it applies to everything from multi-million-dollar initiatives to daily personal living. Before you start a journey, a transformation, or a lifestyle change, always start with defined goals – start with purpose. When applying AI, know what current problem you’re trying to solve, or what future capability you’re trying to enable, and know how to validate it – either towards very specific metrics (e.g., 80% of manual functional tests need to automate) or qualitative but meaningful milestones (e.g., team should be able to switch between manual and automated test execution within a sprint).

To support building and executing on your strategy, RCG offers its significant experience, along with its partnerships with leaders in each of the areas discussed. Our program is focused on easing your journey to adoption and guiding it to create definitive value in the organization. We partner with you to avoid and mitigate pitfalls so that you can confidently leverage all the advantages that AI and Machine Learning provides towards Modern Quality Engineering.

Download a PDF version of this guide by filling out this form

thumbnail Modern QE part 3

Please fill out the form to download the PDF.