Words Matter: Testing Copy With Shakespeare
By Tim Brandall, Shawn Xu, Pu Chen and Jen Schaefer
It sounds like a dream come true for word-focused product people: a tool that lets Content Designers, Language Managers and others run lighter-weight copy tests, without heavy engineering support. At Netflix, this has been the reality for around a year, thanks to a tool known as Shakespeare. Designed and built by the Internationalization team, Shakespeare has unlocked copy testing in ways that weren’t possible before — from validating word choices to language optimizations for emerging markets, to timely tonal updates.
The Backstory
Since the DVD days, Netflix’s success has been shaped by product innovation validated by a/b testing — experiments where two or more variants of an experience are shown to users at random and statistical analysis determines which variation performs better for a given goal. Our UI is constantly evolving, and the experience you get today may not be the same one you find tomorrow.
Pop quiz time: How do we decide what to test? Do we:
- Copy others?
- Go with whatever we hear about most on Twitter?
- Hire an expert and let them decide?
- Let leadership make the call?
- Vote on what we think will work best?
The answer is actually “none of the above.” Test hypotheses come from a combination of big data analytics, qualitative research and informed judgment, with lots of vigorous debate along the way.
Let’s look at a real-world example. Recently, we wanted to see whether changing the tone of the call to action (CTA) on one of our buttons would impact sign-ups. To test out this idea, we came up with several different CTAs to show to different audience segments, then measured the results. Here are four examples:
We ended up productizing “Get Started” — although due to the nature of a/b testing, we’ll keep on experimenting to see if we can find a higher-performing result!
Prior to Shakespeare, we lacked an efficient way to run a copy test like this one because it involved a lot of engineering work. Copy tests were few and far between, and we were missing out on a lot of language-focused wins — not just in English, but globally.
Introducing Shakespeare
To figure out a solution, the Internationalization team started by closely examining how copy testing was handled. Major pain points included:
- Every test required non-trivial engineering support.
- Tests weren’t easy to set up, configure or clean up afterwards.
- Tests took days to weeks to deploy.
- We didn’t have a uniform way to test across platforms.
- There was no easy way to test localized or transcreated copy.
- We lacked the infrastructure for continuous explore/exploit tests (for which real-time analysis detects winning versions early and then allocates more users to those cells vs. traditional, equally distributed a/b tests).
The way we handled copy testing was hacky even in English — but when you added in the dozens of languages Netflix is translated into, it became clear that not only was our process inefficient, it couldn’t handle our increasing need for copy testing for various languages and cultures.
What we needed was a quicker, easier way to test copy via a well-designed system that scales. The resulting tool, Shakespeare, is a system that, once integrated, allows Content Designers, Language Managers and others to deploy copy tests in literally minutes (although we partner with our Data Scientists to make sure we’re looking at the right metrics and are also careful not to conflict with already running tests).
Major benefits of Shakespeare include:
- Engineering dependencies are greatly reduced.
- Tests are easier to set up, configure and clean up.
- We have the ability to make real-time copy updates.
- We’re able to consistently test across platforms.
- Localized or transcreated copy can be tested independent from the English source.
- We have the option for continuous explore/exploit.
Additionally, the Internationalization team partnered with the XD Content Design team to incorporate a tone-tagging system that we hope will give us valuable insights into voice and tone to help make sure everyone on Netflix feels welcome. (Read on for more about the Content Design perspective.)
Building Shakespeare
To create Shakespeare, it was essential that we start with an understanding of how strings are stored and fetched.
String storage
From an engineer’s code, strings are sent to a message repo that’s linked to our translation tool.
String fetching
From the message repo, a centralized platform service layer acts as an agent between the client apps. Since the platform service layer is a simple pass-through, the client apps need to retain the test-copy to test-cell mapping information.
Prior to Shakespeare, engineers needed to enter the test copy into the resource bundle and write custom logic in the source code to handle test-copy and test-cell mappings. This step was manual and could take days to weeks to complete, depending on available engineering resources and the build/release schedule. Also, engineers needed to manually remove the non-winning copy, along with the business logic, after the test was complete.
Improving the Process
We designed Shakespeare to abstract the manual steps and business logic. The person running the Shakespeare test just needs to tell the system where to find the production copy and enter the test variants through the Shakespeare Web UI. From there, Shakespeare automatically takes care of the mapping between test copy and test cell. (Note that Shakespeare can only be used to test new versions of existing UI strings, when there’s no other design variant.)
Setting up a Shakespeare test
To run a copy test, the Content Designer, Language Manager or PM simply enters a pointer to the production copy and then enters the various copy variants.
The Shakespeare UI saves the extra copy variants in the message repo, facilitating the deployment of the test.
Establishing test copy with a/b test-cell mappings
The other significant question we needed to answer was how to most effectively handle the mappings between the tests cells and test copy. We opted to designate that responsibility to a rules engine. Once the copy-test runner defines the copy variant for each a/b test-cell number, the Shakespeare Web UI saves the mapping information and passes that info down to the rules engine for cloud data publishing.
Here’s an example of how the rules look:
How strings are fetched with Shakespeare
The Shakespeare API examines a/b allocation for the Netflix user and retrieves the correct copy for that user based on their cell allocation and rules mapping.
At Netflix, we support four main platforms: TV, Web, Android and iOS. It was necessary for each of these teams to integrate with the Shakespeare API, requiring a couple of weeks of work from each. (We’ve now completed the process for all but TV, which is in progress.)
The whole picture:
Shakespeare communicates with an a/b platform service for allocation, user test-cell lookup and result analysis. It also talks to a screenshot repo to fetch screenshots of where the test copy appears.
The Shakespeare web UI is implemented with React. The test metadata is stored in the Cassandra database. Shakespeare interfaces with the database and storage repos via the REST API on top of Spring Boot and Hibernate.
Continuous Integration (CI) and Continuous Delivery/Deployment (CD)
The Shakespeare mapping rules are built, validated and deployed continuously, in real time.
In summary
To create Shakespeare, we took it step-by-step.
- The Shakespeare Web UI makes it easy to enter copy variants.
- A rules engine extracts test-cell copy mapping logic.
- A data-subscription service handles rules distribution.
- Our proprietary tool ABlaze allocates tests.
- Shakespeare returns real-time user test-cell examination and copy override.
- Continuous Integration and Continuous Delivery/Deployment provides easy integration and real-time deployment.
So far, we’ve run around 50 Shakespeare tests — and we’re just getting started.
The Content Design Perspective
The lighter-weight copy testing made possible by Shakespeare provides more user insights into the copy we’re creating. Language-focused areas of testing that our Content Designers, Language Managers and others have explored or are planning to explore include:
- Word choice for microcopy. Even the smallest change can have a huge impact.
- Tone. Our voice attributes are Helpful, Warm, Playful, Relevant and Provocative. When should we lean into different tones? Shakespeare is helping us find out.
- Global relevance. Sometimes a language hypothesis created in Silicon Valley or L.A. doesn’t resonate in other areas of the world and feels more natural when it’s customized to the market.
- UX best practices. By adding step numbers to the copy in our onboarding flow, we were able to increase the completion rate because people knew how many steps to expect.
- Style. We avoid all caps because they can feel shouty, but is there ever an exception to this rule?
- Clarity. Is there a simpler, more intuitive or more inclusive way to explain something?
- Context awareness. When the coronavirus pandemic began, we were able to quickly modify the text in our sign-up flow, since the “before” version felt tone-deaf in light of travel restrictions and more time spent at home.
Beyond the Metrics
Besides metric and UX wins, a bonus Shakespeare benefit is the way it’s brought together Engineers, Content Designers, Globalization experts, PMs, Data Scientists and other cross-functional partners in new and unexpected ways. As Netflix has grown its membership to 200 million global members and counting, it’s more important than ever to represent diverse perspectives in our product — including with the people who are building it.