ICAEA (the International Civil Aviation English Association) have produced Test Design Guidelines (the TDGs) which international and state regulators are now considering adopting as a standard mechanism for judging suitability of tests in the AELP (plain English proficiency for aviation communications) context.

The ICAEA TDGs have led to an ICAO working group being set up to consider adapting the TDGs as regulation, and they have drafted a TDG Handbook which is being circulated among regulators and other industry insiders.

A standard mechanism is desirable since the variability of standards in test integrity, security, quality, and reliability is so painfully obvious.

However, ICAEA makes some strong assertions in their TDGs – some of which seem to contradict ICAO 9835 – and we have found no evidence of research data on their part to justify some of their claims.

The TDG Handbook offers rationales for operating as they propose, but these rationales already assume that the proposed changes are positive. They contain no evidence or data to support the changes, and we have been unable to locate any such evidence or data elsewhere.

It is therefore down to professional testers such as ourselves to highlight concerns more widely.

It can be argued that the TDGs contain fundamental flaws. Here's why we believe so, in brief:

1) Mandatory role play tasks

The TDGs demand that all tests contain Radiotelephony (RT) role-play tasks in which candidates and examiners simulate an RT interaction in standard phraseology that will include a need for plain English. There are, it seems to us, multiple problems with this demand:

1a) Interpreting poor performance

Imagine an otherwise strong, near-native level, speaker who performs poorly on such a task.  How should assessors rate her performance? How do assessors know whether it is plain English or RT proficiency (or both) that is the problem?

The construct under test (i.e. AELP) may well experience interference from this second variable (RT proficiency) and this could lead to faulty assessments and greater unreliability generally.

Tasks would need very careful design to try to minimize this interference, and yet while there is a comment in the TDG Hanbook (ICAEA, unpublished) about the need to "avoid assessment of proficiency in phraseology", there is no guidance in the TDGs about how to achieve this, or even an acknowledgement that it might be a problem.

Lenguax Director Tyrone is British and speaks English natively. During his flight training he had to learn RT and admits that he struggled to gain spoken fluency: “radiotelephony has no native speakers, even I struggled with it when I first started, there was disfluency which led to the necessity to retransmit and ask for clarification and repetition. It is a skill which needs to be tested, and tested separately.”

Radiotelephony has no native speakers...it is a skill which needs to be tested, and tested separately


PPL Pilot & Language Tester

1b) Scripting

Role-play tasks are highly scripted so that the examiner and test-taker know what to say and when. It is extremely problematic to achieve unrehearsed, unscripted, authentic spontaneous speech or interactions. Simulating RT is situationally authentic; however, to perform in the task, the test-taker must allow herself to be led by the examiner’s script, leading to predictable responses in a heavily controlled task. This, clearly, is not an authentic situation.

1c) Elicitation of plain language 

In our experience, role-play tasks tend to elicit standard phraseology, or else language that is provided by the prompts used to conduct the role play. There is often very little original, plain English produced for assessment.

This issue is reflected in tasks on ICAEA's own RSS project. To verify what we say, choose one of the role play tasks there, and ask yourself what percentage of the interaction is in plain English.

Construct representation is an important concept in language testing that refers to the extent to which a language test accurately measures the underlying construct or ability it is intended to measure. Clearly, if role-play tasks under-represent plain language then that is a significant issue.

If ICAEA wishes to make the claim that role plays, by their nature, do indeed represent the plain language construct under test, then that is an empirical claim that we have not yet seen the evidence for.

1d) Standardisation

Role-play tasks are extremely difficult to standardise across candidate populations

It is our view that this is a serious issue that would require addressing if RPTs were to become a necessary part of language testing. 

How are test providers to ensure that their RPTs for different candidate populations are of equivalent challenge or variability? How are other stakeholders able to judge this? 

Again, we have seen no guidance on this, nor any acknowledgement that it may be an issue.

If left unaddressed, stakeholders could not be sure that test-takers are receiving equally challenging tasks, leading to feelings of unfairness and reducing confidence in the outcomes of testing generally.

1e)  Practicality

Role-play tasks are difficult to perform meaningfully.

For example, test-takers must be provided with the plain English information crucial for task completion; how this is presented to test-takers so they may perform a task without simply giving them ‘the answers’ for the task is a significant practical challenge to test designers – and many testing companies will find it impossible to do this well/meaningfully. 

Once again, if we are to enforce the use of RPTs, is it not important to address this shortcoming with the task type?


2) “Authenticity” as an end in itself

It is common knowledge in the language testing literature that fundamental tenets of good test design are validity and reliability. These are defined terms with a meaning understood by all in the field.

Test designers are obliged to provide coherent explanations of their tests' reliability and validity, using the accepted definitions of those terms.

ICAO in Document 9835 cite the ALTE Principles of Good Practice which promote reliability and validity, alongside ‘practicality’ and ‘impact’ as central pillars necessary for good language tests.

However, ICAEA's (unpublished) TDG Handbook additionally claims that "authenticity" is  "one of the fundamental elements in specific purpose language testing", but offers no justification for such a claim, nor any mechanism by which stakeholders may assess a test's "authenticity" or lack thereof. 

How is "authenticity" being defined here? How are we to determine whether a test has it or not? Why isn't it properly considered an aspect of validity, and instead being set up as a goal in itself?

Indeed, there is a deep potential confusion here between what one might call "situational authenticity" (talking about the kinds of things one might really talk about over the radio) and "interactional authenticity" (talking to an interlocutor in a similar way to the way one might do it over the radio). These are by no means the same. 

...replication of RT communications...confuses situational authenticity with interactive authenticity... leading to predictable responses in a heavily controlled task

Language testing peer of lenguax

One language testing peer communicated to us the following about the TDGs:

"as well as flying in the face of guidance provided in ICAO Doc 9835, Chapter 6, the attempted close/narrow replication of RT communications possibly confuses situational authenticity with interactive authenticity. Certainly, simulating RT is situationally authentic; however, in order to perform in the task, the test-taker must allow herself to be ‘led’ by the interlocutor via the interlocutor script, leading to predictable responses in a heavily controlled task. Where you take away the independence of the test-taker in managing the interaction in terms of turn-taking, reciprocity and content, it would be difficult to argue that this has interactive authenticity – or even genuine interactiveness."

Test designers must consider how practical constraints should be overcome / balanced when designing tasks to meet their own test task guidelines.

ICAEA chooses to overlook ‘practicality’ completely, and promote (without explanation) the concept of ‘authenticity’ (which, to our knowledge, is not well-defined in the literature) to support their main argument for role-play tasks.

Our fear is that insistence upon the inclusion of such tasks will lead to the production of "box-ticking" role-play tasks that do little to differentiate between candidates, and in practice provide little meaningful "authenticity" in terms of their interactivity.  

3 - Listening tasks

The TDGs demand that separate listening tasks are mandatory. This is in our view over-prescriptive and – again – impractical in many ways.

Whenever language testers try to measure listening comprehension, there are interfering conflicts which must be considered.

In a listening test for which candidates must read questions, and then write or type answers, reading, writing or even spelling skills might impact the assessment of listening comprehension.

On the other hand, when oral responses are required to demonstrate comprehension, a candidate’s (in)ability to speak accurately or coherently may impede the assessment of listening comprehension.

This is the reality that all test designers and researchers must contend with so as to find the best balance and justify their approach.

We await sight of direct evidence that assessing listening separately is objectively a superior method of arriving at a truer assessment of candidates' comprehension before we can support the inclusion of this rule.

In conclusion

In addition to our academic credentials, the directors at Lenguax have accumulated many years of practical experience in this field.

We do not position ourselves as unchallengeable authorities on these subjects: we are always trying to collaborate, learn and improve. Nor are we wilfully ignoring the ideas behind the TDGs. If necessary, we would be able to adapt our approach to meet them.  

However, we strongly believe our service would be no more valid or reliable if we did so. And we are convinced that while AELP testing is in a dire state and does require more consistent and thorough regulation, the TDGs will do nothing to improve the standard of tests. In fact, we suspect they may make matters worse by imposing arbitrary mandates that, without thorough consideration of the drawbacks inherent in the enforced use of the approach, may lead to box-ticking activities that do nothing to improve the current situation.

Such problems are not helped by the use of terms such as "authenticity" which lack (to our knowledge) rigorous academic definition or a meaningful way of establishing its presence or absence in a test.

We have made our views known to senior members of the ICAEA Board.

What could be done instead of the TDGs?

It is fair to believe that regulators need support in approving and overseeing output from AELP test service entities. All test service providers can be accused of self-interest – that’s the nature of the beast when you run a commercial company in a competitive industry – but open discussion and debate of evidence-based proposals can advance regulation of the industry.

In the next post, we will consider some possible avenues of research that might clear up some of the confusion we have identified above, as well as some questions which might be considered at regulatory level to improve AELP test outcomes generally.


If you would like to comment on this article, please do so on our LinkedIn post at this link.

Notify of

Inline Feedbacks
View all comments
{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
Would love your thoughts, please comment.x