In our previous post we expressed our belief that certain claims made in ICAEA's Test Design Guidelines regarding specific task types lack sufficient empirical evidence. Additionally, we highlighted some drawbacks associated with the implementation of these task types, which seemed to have been overlooked.

We maintain that relying solely on personal opinions is an inadequate justification for imposing task types on the testing community, particularly when such task types introduce new challenges for test designers that have been insufficiently explored.

The original post raised several issues that we would like to address by seeking empirical grounding:

1) Construct Representation in Role Play Tasks

According to our understanding, ICAEA asserts that Role Play Tasks (RPTs) can adequately represent the construct of plain English in aviation, while other tasks cannot.

However, we argue that during most RPTs, candidates produce minimal plain language, and when they do, it is often prompted by the task itself.

It is worth noting that the quality of the RPT can be a confounding variable, which the Test Design Guidelines do not specifically address, potentially undermining their objective of improving the quality of aviation English testing.

These conflicting claims present a straightforward dispute that could be resolved empirically.

For example, we propose randomly selecting a sample of candidate responses to a high-quality RPT, providing the written prompts used during the task, and allowing independent assessors with sufficient knowledge in plain English and RPTs to evaluate the proportion of spontaneously produced plain English by the candidate.

We invite proponents of such tasks to present a task and rubric that could be collaboratively developed to facilitate this procedure. Analysing such tasks could lead us to understand the necessary RPT design for eliciting sufficient plain English to enable an accurate plain English assessment.

Do any proponents of such a task wish to present a task and rubric that we might use to collaboratively develop such a procedure?

Analysis of such tasks might lead us to be able to discern what kind of RPT design is needed in order to elicit sufficient plain English for a plain English assessment to be possible.

If anyone agrees to help us with this task, we would undertake to document and publish the process with full transparency.

2) Task Equivalence

Another concern we raised is the potential disparity in task difficulty when designing RPTs for different candidate populations.

At Lenguax, we have established documented procedures to help to create task equivalence across various dimensions in item writing, such as language complexity, prompt length, coverage of different language functions, and lexical domains.

However, achieving task equivalence in RPTs seems to pose a formidable challenge for several reasons. How can we ensure that a test for Air Traffic Controllers (ATCs) is not perceived as being (or actually is) easier than one for pilots? Or that a test for student pilots is not easier than one for commercial pilots?

If it is deemed necessary for all test providers to produce RPTs for all roles being tested, it becomes crucial to take measures to ensure task equivalence.

If work has already been done to address the considerations required for producing equivalent RPTs across aviation roles, we would be willing to host and share such information for the benefit of test providers in general, to promote fairness. We acknowledge the possibility that we may be unaware of existing work in this area, which may also apply to most other test providers.

If no such work has been done, though, we are open to collaborating with others to develop equitable ways of designing such tasks.

3) Listening Comprehension may not be assessed orally

The proposed mandate assumes that assessing comprehension using oral feedback from the candidate introduces a confounding variable that results in either unfairly high or low scores for Comprehension. 

It is not clear which claim is being made.

Additionally, the assumption implies that confounding variables associated with assessing comprehension using other modes (e.g., written summaries, multiple choice) are significantly less relevant.

Similarly to the issue above about plain language elicitation, this seems to be an empirical claim for which we have been unable to find existing evidence. We are willing to collaborate with others in an open and transparent manner to produce such evidence.

To begin, it is essential to establish the validity of this assumption in general. Does assessing comprehension through oral feedback and the methods which ICAEA seek to impose on test providers in fact produce different ICAO ratings?

TEAC employs oral feedback as a proxy to infer a candidate's comprehension, providing a benchmark for this aspect of language ability. One potential approach to investigate this matter further would involve having trial candidates take a TEAC test and then perform a selected comprehension test using a different instrument, of the sort proposed by ICAEA.

If significant differences in the ratings awarded are discovered, it would lend credibility to ICAEA's viewpoint. Although this would require substantial effort in finding a diverse pool of candidates willing to undertake both listening tasks, developing a transparent method for creating the listening tasks that satisfies proponents of both views, and ensuring impartial and transparent rating and data analysis, it is not an impossible endeavour in principle.

It would also be necessary to develop a robust task rubric that is fair to all parties involved.

For our part, we are prepared to contribute our resources, but we would need to collaborate with others who possess a listening instrument approved by ICAEA to provide evidence on this matter.


In conclusion, our previous post expressed concerns regarding the lack of empirical support for certain assumptions made in ICAEA's Test Design Guidelines. We highlighted the need for evidence-based decision-making and collaboration to address important issues such as construct representation in role play tasks, task equivalence across candidate populations, and the assessment of listening comprehension. We remain open to constructive discussions and collaborations to ensure fairness and improve the quality of aviation English testing.

As before, we invite anyone wishing to comment publicly to do so on our LinkedIn page. If you wish to contact us privately you can do so at this email address.

December 16, 2023

What is "Code-Switching"? Discussions around Aviation English Language Proficiency testing often use the term code-switching. In this article, we’d like to consider this term and the implications of this phenomenon for language testing.Code-switching is defined (by Barbara Bullock and Almeida Jacqueline Toribio ...

Read More

August 7, 2023

Aviation English Pronunciation As anyone who has studied or worked with the ICAO Language Proficiency Requirements (LPRs) will know, the Pronunciation level descriptors are less than perfect at distinguishing between the different levels. For example, between Levels 4 and 5 ...

Read More

June 17, 2023

With changes afoot to the regulatory environment in aviation English, we feel it important to understand properly the existing situation, some aspects of which are unclear to us. Here we propose some questions to ask regulators. Do others wish to ...

Read More

May 9, 2023

ICAEA (the International Civil Aviation English Association) have produced Test Design Guidelines (the TDGs) which international and state regulators are now considering adopting as a standard mechanism for judging suitability of tests in the AELP (plain English proficiency for aviation ...

Read More
Notify of

Inline Feedbacks
View all comments
{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}
Would love your thoughts, please comment.x