The Famous New York Taxis Example
Everyone with an interest in data privacy knows the New York Taxis story, dating back to 2014. It’s a great story with some good examples of the difficulties we face when we want to share or make use of online data without compromising the privacy of the people represented therein. In this particular case the organisation had no choice: the “anonymised” data was made available following a citizen’s Freedom of Information Act (FOIA) request. There were some text-book mistakes which made them world famous within days, and beautifully illustrate why data privacy is a lot harder than it might first appear. Simple removal of Personal Information (PI) and masking of key identifiers from a dataset will put you on the right side of the law in most countries (the European Union being a notable exception) but is not nearly sufficient to properly protect the privacy of individuals.
The weakest link in this story was the encryption applied to the key identifiers (taxi medallions and license numbers). Once these were decrypted the data was subject to all manner of linking and inference attacks using publicly available data or other knowledge. One such example was the use of paparazzi photos available on the internet: link a picture of someone getting into an identified cab at a specific time and place and you can discover where they went and how much they payed and tipped! A few celebrities and local dignitaries were “unmasked” in this way.
One key reason the New York Taxis story, and countless other stories since, played out this way was because the publisher really had no way of understanding the risks before applying redactions and releasing the dataset. The other, of course, is the scarcity of good data privacy tools and techniques which are usable by the relatively uninitiated. Most organisations don’t have the skills and tools available to do privacy well or recognise what that entails.
The Impact in Education Technology
It is reported that global spend on Education Technology (EdTech) will reach $250bn by the end of 2020 (EdTechXGlobal). It is a fast-growing sector with students all over the world increasingly conducting at least some of their learning on a purpose-built online learning platform. Students learning online include school (k-12) and university students, as well as people in a wide variety of adult education settings from traineeships and apprenticeships to career development and re-training.
As the number of students using EdTech increases, so does the pool of data which the students leave on these online platforms. This data is increasingly seen as the key to better understanding how students learn. The relatively new field of Learning Analytics (LA) has grown from a number of disciplines including learning theory / pedagogy, cognitive science, psychology and artificial intelligence/machine learning with the purpose of harvesting the information from online EdTech platforms to gain insights and generate models of our students and their learning processes. Lines of enquiry are extremely broad, including for example, modelling skills development and competencies, understanding student motivations and engagement, or mining student activities and attitudes to develop and test theoretical models of learning. The promise of LA is a significant increase in the quality and personalisation, thus effectiveness, and also reach, of online learning. Much of the information gathered by the online EdTech platforms, and used in Learning Analytics includes Personal Information (by the standard of the Australian Privacy Act 1988) and can include information such as assessments and associated comments or grades which could become sensitive to the individual at some future point in time (when applying for jobs or public positions for example).
Even for the skilled and initiated, current approaches to data privacy in EdTech are inadequate and ad hoc. Privacy preservation approaches, loosely termed “data perturbation”, lose valuable information, reducing the data’s utility for Learning Analytics. Worse still, they can still be re-identified using common approaches as listed above. These shortcomings severely limit the ability of educators and technologists to safely use the data in their possession to innovate and improve.
What if there were a way to consistently measure privacy risk across any data. To analyse student data and come up with one or two universal measures of the risk of personal re-identification. A consistent measure of privacy risk would allow us to develop standards or measures of “fitness” for our data, based on our risk appetites, and consistent, policy-based data management and sharing approaches. It would allow us to establish data sharing agreements in a similar way. Most importantly, it would allow us to measure how effective our privacy risk reduction mechanisms really are before releasing it to the public or distributing within our organisations for analytics and reporting.
What if there were a set of privacy risk reduction tools. Tools which abstracted away from the very technical and specialised techniques required for privacy risk reduction today. Which offered a range of risk-reduction options depending on your specific needs and risk appetite. And which helped strike the right balance of measurable (and provable) privacy risk reduction with adequate utility of the data for analytics purposes. These tools, in conjunction with a consistent measure of privacy risk and risk reduction, would help lift the capability the Education and EdTech industry in data privacy risk management and enable a “privacy first” approach to Learning Analytics projects. And, in so doing, allow us to share and make greater use of learning data, and innovate and improve online learning, whilst measurably protecting the privacy of our learners, teachers and mentors.
The good new is, the global research community has made significant advances in the development of effective and provable privacy risk measurement and reduction. This project pulls together a multi-disciplinary team of researchers and education industry leaders to build on and adapt these technologies to Education and Learning Analytics. We are a passionate group with a shared vision to build a data privacy risk management platform and enable the education industry to grow and meet the needs of our students and educators in the 21st century.