Mar 232018
 

Today I am back at the Data Fest Data Summit 2018, for the second day. I’m here with my EDINA colleagues James Reid and Adam Rusbridge and we are keen to meet people interested in working with us, so do say hello if you are here too! 

I’m liveblogging the presentations so do keep an eye here for my notes, updated throughout the event. As usual these are genuinely live notes, so please let me know if you have any questions, comments, updates, additions or corrections and I’ll update them accordingly. 

Intro to Data Summit Day 2 – Maggie Philbin

We’ve just opened with a video on Ecometrica and their Data Lab supported work on calculating water footprints. 

I’d like to start by thanking our sponsors, who make this possible. And also I wanted to ask you about your highlights from yesterday. These include Eddie Copeland from Nesta’s talk, discussion of small data, etc. 

Data Science for Societal Good — Who? What? Why? How? –  Kirk Borne, Principal Data Scientist and Executive Advisor, Booz Allen Hamilton

Data science has a huge impact for the business world, but also for societal good. I wanted to talk about the 5 i’s of data science for social good:

  1. Interest
  2. Insight
  3. Inspiration
  4. Innovation
  5. Ignition

So, the number one, is the Interest. The data can attrat people to engage with a problem. Everything we do is digital now. And all this information is useful for something. No matter what your passion, you can follow this as a data scientist. I wanted to give an example here… My background is astrophysics and I love teaching people about the world, but my day job has always been other things. About 20 years ago I was working in data science at NASA and we saw an astronomical – and I mean it, we were NASA – growth in data. And we weren’t sure what to do with it, and a colleague told me about data mining. It seemed interesting but I just wasn’t getting what the deal was. We had a lunch talk from a professor at Stanford, and she came in and filled the board with equations… She was talking about the work they were doing at IBM in New York. And then she said “and now I’m going to tell you about our summer school” – where they take kids from inner city kids who aren’t interested in school, and teach them data science. Deafening silence from the audience… And she said “yes, we teach the staff data mining in the context of what means most for these students, what matters most. And she explained: street basketball. So IBM was working on a software called IBM Advanced Calc specifically predicting basketball strategy. And the kids loved basketball enough that they really wanted to work in math and science… And I loved that, but what she said next changed my life.

My PhD research was on colliding galaxy. It was so exciting… I loved teaching and I was so impressed with what she had done. These kids she was working with had peer pressure not to be academic, not to study. This school had a graduation rate of less than 50%. Their mark of success for their students was their graduation rate – of 98%. I was moved by that. I felt that if this data science has this much power to change lives, that’s what I want to do for the rest of my lives. So my life, and those of my peers, has been driven by passion. My career has been as much about promoting data literacy as anything else.

So, secondly, we have insight. Traditionally we collect some data points but we don’t share this data, we are not combining the signals… Insight comes from integrating all the different signals in the system. That’s another reason for applying data to societal good, to gain understanding. For example, at NASA, we looked at what could be combined to understand environmental science, and all the many applications, services and knowledge that could be delivered and drive insight from the data.

Number three on this list is Inspiration. Inspiration, passion, purpose, curiousity, these motivate people. Hackathons, when they are good, are all about that. When I was teaching the group projects where the team was all the same, did the worst and least interestingly. When the team is diverse in the widest sense – people who know nothing about Python, R, etc. can bring real insights. So, for example my company run the “Data Science Bowl” and we tackle topics like Ocean Health, Heart Health, Lung Cancer, drug discovery. There are prizes for the top ten teams, this year there is a huge computing prize as well as a cash prize. The winners of our Heart Health challenge were two Wall Street Quants – they knew math! Get involved!

Next, innovation. Discovering new solutions and new questions. Generating new questions is hugely exciting. Think about the art of the possible. The XYZ of Data Science Innovation is about precision data, precision for personalised medicine, etc.

And fifth, ignition. Be the spark. My career came out of looking through a telescope back when I lived in Yorkshire as a kid. My career has changed, but I’ve always been a scientist. That spark can create change, can change the world. And big data, IoT and data scientists are partners in sustainability. How can we use these approaches to address the 17 Sustainability Development Goals. And there are 229 Key Performers Indicators to measure performance – get involved. We can do this!

So, those are the five i’s. And I’d like to encapsulate this with the words of a poet…. Data scientists – and that’s you even if you don’t think you are one yet. You come out of the womb asking questions of the world. Humans do this, we are curious creatures… That’s why we have that data in the first place! We naturally do this!

“If you want to build a ship, don’t drum up people to gather wood adn don’t assign them tasks and work, but rather teach them to yearn for the vast and endless sea”

– Antoine de Saint-Exupery.

This is what happened with those kids. Teach people to yearn for the vast and endless sea, then you’ll get the work done. Then we’ll do the hard work

Slides are available here: http://www.kirkborne.net/DataFest2018/

Q&A

Comment, Maggie Philbin) I run an organisations, Teen Tech, and that point that you are making of start where the passion actually is, is so important.

KB) People ask me about starting in data science, and I tell them that you need to think about your life, what you are passionate about and what will fuel and drive you for the rest of your life. And that is the most important thing.

Q1) You touched on a number of projects, which is most exciting?

A1) That’s really hard, but I think the Data Bowl is the most exciting thing. A few years back we had a challenge looking at how fast you can measure “heart ejection fraction – how fast the heart pumps blood out” but the way that is done, by specialists, could take weeks. Now that analysis is built into the MRI process and you can instantly re-scan if needed. Now I’m an astronomer but I get invited to weird places… And I was speaking to a conference of cardiac specialists. A few weeks before my doctor diagnosed me with a heart issue…. And that it would take a month to know for sure. I only got a text giving me the all clear just before I was about to give that talk. I just leapt onto that stage to give that presentation.

The Art Of The Practical: Making AI Real – Iain Brown, Lead Data Scientist, SAS

I want to talk about AI and how it can actually be useful – because it’s not the answer to everything. I work at SAS, and I’m also a lecturer at Southampton University, and in both roles look at how we can use machine learning, deep learning, AI in practical useful ways.

We have the potential for using AI tools for good, to improve our lives – many of us will have an Alexa for instance – but we have to feel comfortable sharing our data. We have smart machines. We have AI revolutionising how we interact with society. We have a new landscape which isn’t about one new system, but a whole network of systems to solve problems. Data is a selleble asset – there is a massive competitive advantage in storing data about customers. But especially with GDPR, how is our data going to be shared with organisations, and others. That matters for individuals, but also for organisations. As data scientists there is the “can” – how can the data be used; and the “should” – how should the data be used. We need to understand the reasons and value of using data, and how we might do that.

I’m going to talk about some exampes here, but I wanted to give an overview too. We’ve had neural networks for some time – AI isn’t new but dates back to the 1950s. .Machine learning came in in the 1980s, deep learning in the 2010s, and cognitive computing now. We’ve also had Moore’s Law changing what is theoretically possible but also what is practically feasible over that time. And that brings us to a definition “Artificial Intelligence is the science of training systems to emulate human tasks through learning and automation”. That’s my definition, you may have your own. But it’s about generating understanding from data, that’s how AI makes a difference. And they have to help the decision making process. That has to be something we can utilise.

Automation of process through AI is about listening and sensing, about understanding – that can be machine generated but it will have human involvement – and that leads to an action being made. For instance we are all familiar with taking a picture, and that can be looked at and understood. For instance with a bank you might take an image of paperwork and passports… Some large banks check validity of clients with a big book of pictures of blacklisted people… Wouldn’t it be better to use systems to achieve that. Or it could be a loan application or contract – they use application scorecards. The issue here is interpretability – if we make decisions we need to know why and the process has to be transparent so the client understands why they might have been rejected. You also see this in retail… Everything is about the segment of one. We all want to be treated as individuals… How does that work when you are one of millions of individuals. What is the next thing you want? What is the next thing you want to click on? Shop Directory, for instance, have huge ranges of products on their website. They have probably 500 pairs of jeans… Wouldn’t it be better to apply their knowledge of me to filter and tailor what I see? Another example is the customer complaint on webchat. You want to understand what has gone wrong. And you want to intervene – you may even want to do that before they complain at all. And then you can offer an apology.

There are lots of applications for AI across the board. So we are supporting our customers on the factors that will make them successful in AI, data, compute, skillset. And we embed AI in our own solutions, making them more effective and enhancing user experience. Doing that allows you to begin to predict what else might be looked at, based on what you are already seeing. We also provide our customers with extensible capabilities to help them meet their own AI goals. You’ll be aware of Alpha Go, it only works for one game, and that’s a key thing… AI has to be tailored to specific problems and questions.

For instance we are working on a system looking at optimising the experience of watching sports, eliminating the manual process of tagging in a game. This isn’t just in sport, we are also working in medicine and in lung cancer, applying AI in similar 3D imaging ways. When these images can be shared across organisations, you can start to drive insights and anomalies. It’s about collaborating, bringing data from different areas, places where an issue may exist. And that has social benefit of all of us. Another fun example – with something like wargaming you can understand the gamer, the improvements in gameplay, ways to improve the mechanics of how game play actually works. It has to be an intrinsic and extrinsic agreement to use that data to make that improvement.

If you look at a car insurer and the process and stream of that, that’s typically through a call centre. But what if you take a picture of the car as a way to quickly assess whether that claim will be worth making, and how best to handle that claim.

I value the application, the ways to bring AI into real life. How we make our experiences better. It’s been attributed to Voltaire, and also to Spiderman, that “with great power comes great responsibility”. I’d say “with great data power comes great responsibility” and that we should focus on the “should” not the “could”.

Q&A

Comment) A correction on Alpha Go: Alpha Zero plays Chess etc. It’s without any further human interaction or change.

Q1) There is this massive opportunity for collaboration in Scotland. What would SAS like to see happen, and how would you like to see people working together?

A1) I think collaboration through industry, alongside academia. Kirk made some great points about not focusing on the same perspectives but on the real needs and interest. Work can be siloed but we do need to collaborate. Hack events are great for that, and that’s where the true innovation can come from.

Q2) What about this conference in 5 years time?

A2) That’s a huge question. All sorts of things may happen, but that’s the excitement of data science.

Socially Minded Data Science And The Importance Of Public Benefits – Mhairi Aitken, Research Fellow, Usher Institute of Population Health Sciences and Informatics, University of Edinburgh

I have been working in data science and public engagement around data and data science for about eight years and things have changed enormously in that time. People used to think about data as something very far from their everyday lives. But things have really changed, and people are aware and interested in data in their lives. And now when I hold public events around data, people are keen to come and they mention data before I do. They think about the data on their phones, the data they share, supermarket loyalty cards. These may sound trivial but I think they are really important. In my work I see how these changes are making real differences, and differences in expectations of data use – that it should be used ethically and appropriately but also that it will be used.

Public engagement with data and data science has always been important but it’s now much easier to do. And there is much more interest from funders for public engagement. That is partly reflecting the press coverage and public response to previous data projects, particularly NHS data work with the private sector. Public engagement helps address concerns and avoid negative coverage, and to understand their preferences. But we can be even more positive with our public engagement, using it to properly understand how people feel about their data and how it is used.

In 2016 myself and colleagues undertook a systematic review of public responses to sharing and linking of health data for research purposes (Aitken, M et al 2016 in BMC medical ethics, 17 (1)). That work found that people need to understand how data will be used, they particularly need to understand that there will be public benefit from their data. In addition to safeguards, secure handling, and a sense of control, they still have to be confident that their data will be used for public benefits. They are even supportive if the benefit is clear but those other factors are faulty. Trust is core to this. It is fundamental to think about how we earn public trust, and what trust in data science means.

Public trust is easy to define. But what about “public benefit”. Often when people call about data and benefits from data. People will talk about things like Tesco Clubcard when they think of benefit from data – there is a direct tangible benefit there in the form of vouchers. But what is the public benefit in a broader and less direct sense. When we ask about public benefit in the data science community we often talk about economic benefits to society through creating new data-driven innovation. But that’s not what the public think about. For the public it can be things like improvements to public services. In data-intensive health research there is an expectation of data learning to new cures or treatments. Or that there might be feedback to individuals about their own conditions or lifestyles. But there may be undefined or unpredictable potential benefits to the public – it’s important not to define the benefits too narrowly, but still to recognise that there will be some.

But who is the “public” that should benefit from data science? Is that everyone? Is it local? National? Global? It may be as many as possible but what is possible and practical? Everyone whose data is used? That may not be possible. Perhaps vulnerable or disadvantaged groups? Is it a small benefit for many, or a large benefit for a small group.  Those who may benefit most? Those who may benefit the least? The answers will be different for different data science projects. That will vary for different members of the public. But if we only have these conversations within the data science community we’ll only see certain answers, we won’t hear from groups without a voice. We need to engage the public more with our data science projects.

So, closing throughts… We need to maintain a social license for data science practices and that means continual reflection on the conditions for public support. Trust is fundamental – we don’t need to make the public trust us, we have to actually be trustworthy and that means listening, understanding and responding to concerns, and being trustworthy in our use of data. Key to this is finding public benefits of data science projects. In particular we need to think about who benefits from data science and how benefits can be maximised across society. Data scientists are good at answering questions of what can be done but we need to be focusing on what should be done and what is beneficial to do.

Q&A

Q1) How does private industry make sure we don’t leave people behind?

A1) BE really proactive about engaging people, rather than waiting for an issue to occur. Finding ways to get people interested. Making it clear what the benefits are to peoples lives There can be cautiousness about opening up debate being a way to open up risk. But actually we have to have those conversations and open up the debate, and learn form that.

Q2) How do we put in enough safeguards that people understand what they consent to, without giving them too much information or scaring them off with 70 checkboxes.

A2) It is a really interesting question of consent. Public engagement can help us understand that, and guide us around how people want to consent, and what they want to know. We are trying to answer questions where we don’t always have the answers – we have to understand what people need by asking them and engaging them.

Q3) Many in the data community are keen to crack on but feel inhibited. How do we take the work you are doing and move sooner rather than later.

A3) It is about how we design data science projects. You do need to take the time first to engage with the public. It’s very practical and valuable to do at the beginning, rather than waiting until we are further down the line…

Q3) I would agree with that… We need to do that sooner rather than later rather than being delayed deciding what to do.

Q4) You talked about concerns and preferences – what are key concerns?

A4) Things you would expect on confidentiality, privacy, how they are informed. But also what is the outcome of the project – is it beneficial or could they be discriminatory, or have a negative impact on society? It comes back to causing public benefits – they want to see outcomes and impact of a piece of work.

 

Automated Machine learning Using H2O’s Driverless AI – Marios Michailidis, Research Data Scientist, H2O.ai

I wanted to start with some of my own background. And I wanted to talk a bit about Kaggle. It is the world’s biggest preictive modelling competition platform with more than a million members. Companies host data challenges and competitors from across the world compete to solve them for prizes. Prizes can be monetary, or participation in conferences, or you might be hired by companies. And it’s a bit like Tennis – you gain points and go up in the ranking. And I was able to be ranked #1 out of a half million members t here.

So, a typical problem is image classification. Can I tell a cat from a dog from an image. That’s very doable, you can get over 95% accuracy and you can do that with deep learning and neural net. And you differentiate and classify features to enable that decision. Similarly a typical problem may be classifying different bird song from a sound recording – also very solvable. You also see a lot of text classification problems… And you can identify texts from a particular writers by their style and vocabulary (e.g. Voltaire vs Moliere). And you see sentiment analysis problems – particularly for marketing or social media use.

To win these competitions you need to understand the problem, and the metric you are being tested on. For instance there was an insurance problem where most customers were renewing, so there was more value in splitting the problem into two – one for renewals, and then a model for others. You have to have a solid testing procedure – really strong validation environment that reflects what you are being tested on. So if you are being tested on predictions for 3 months in the future, you need to test with past data, or test that the prediction is working to have the confidence that what you do will be appropriately generalisable.

You need to handle the data well. Your preprocessing, your feature engineering, which will let you get the most out of your modelling. You also need to know the problem-specific elements and algorithms. You need to know what works well. But you can look back for information to inform that. You of course need access to the right tools – the updated and latest software for best accuracy. You have to think about the hours you put in and how you optimize them. When I was #1 I was working 60 hours on top of my day job!

Collaborate – data science is a team sport! It’s not just about splitting the work across specialisms, it’s about uncovering new insights by sharing different approaches. You gain experience over time, and that lets you focus your efforts on where you can focus your effort for the best gain. And then use ensembling – combine the methods optimally for the best performance. And you can automate that…

And that brings us to H2O’s diverless AI which automates AI. It’s an AI that creates AI. It is built by a group of leading machine learning engineers, academics, data scientists, and kaggle Grandmasters. It handles data cleaning and feature engineering. It uses cutting edge machine learning algorithms. And it optimises and combines them. And this is all through a hypothesis testing driven approach. And that is so important as if I try a new feature or a new algorithm, I need to test it… And you can exhaustively find the best transformations and algorithms for your data. This allows solving of many machine learning tasks, and it is all in parallel to make it very fast.

So, how does it work? Well you have some input data and you have a target variable. You set an objective or success metric. And then you need some allocated computing power (CPU or GPU). Then you press a button and H2O driverless AI will explore the data, it will try things out, it will provide some predictions and model interpretability. You get a lot of insight including most predictive insights. And the other thing is that you can do feature engineering, you can extract this pipeline, these feature transformations, then use with your own modelling.

Now, I have a minute long demo here…. where you upload data, and various features and algorithms are being tried, and you can see the most important features… Then you can export the scoring pipeline etc.

This work has been awarded Technology of the Year by InfoWorld, it has been featured in the Gartner report.

You can find out more on our website: https://www.h2o.ai/driverless-ai/ and there is lots of transparency about how this work, how the model performs etc. You can download a free trial for 3 weeks.

Q&A

Q1) Do you provide information on the machine learning models as well?

A1) Once we finish with the score, we build the second model which is simple to predict that score. The focus on that is to explain why we have shown this score. And you can see why you have this score with this model… That second interpretability model is slightly less automated. But I encourage others to look online for similar – this is one surrogate model.

Q2) Can I reproduce the results from H2O?

A2) Yes. You can download the scoring practice, it will generate the code and environment to replicate this, see all the models, the data generated, and you can run that script locally yourself – it’s mainly Python.

Q3) That’s stuff is insane – probably very dangerous in the hands of someone just learning about machine learning! I’d be tempted to throw data in… What’s the feedback that helps you learn?

A3) There is a lot of feedback and also a lot of warning – so if test data doesn’t look enough like training data for instance. But the software itself is not educational on it’s own – you’d need to see webinars, look at online materials but then you should be in a good position to learn what it is doing and how.

Q4) You talked about feature selection and feature engineering. How robust is that?

A4) It is all based on hypothesis testing. But you can’t test everything without huge compute power. But we have a genetic algorithm to generate combinations of features, tests them, and then tries something else if that isn’t working.

Q5) Can you output as a model as eg a deserialised JSON object? Or use as an API?

A5) We have various outputs but not JSON. Best to look on the website as we have various ways to do these things.

 

Innovation Showcase

This next session showcases innovation in startups. 

Matt Jewell, R&D Engineer, Amiqus

I’m an R&D Engineer at Amiqus, and also a PhD student in Law at Edinburgh University. Firstly I want to talk about Amiqus, and our mission is to make civil justice accessible to the world. And we are engaged in GDPR as a data controller, but also as a trust and identity provider – where GDPR is an opportunity for us. We created amiqusID to enable people to more easily interact with the law – with data from companies house, driving licenses, etc.

As a PhD student in law there is some overlap in my job and my PhD research, and I was asked about in data ethics. So I wanted to note GDOR Article 22 (3) which states that

“the data controller shall implement suitable measures to safeguard the data subject’s rights and frredoms and legitimate interests, at least the right to obtain human intervention on he part of the controller, to express his or her point of view and to the contest the decision.”

And that’s across the board. GDPR recommits us to privacy, but also embeds privacy as a public good. And we have to think about what that means in our own best practices, because our own practices will shape what happens – especially as GDPR is still quite uncertain, still untested in law.

Carlos Labra, CEO & Co-Founder, Particle Analytics

I come from a mechanical engineering background, so this work is about simulation. And specifically we look at fluids simulation in aircraft. Actually particle simulation is the next step in industry, and that’s because it has been incredibly difficult to do this simulation with computers. We can do basic computer models for large scale materials but not appropriate for particles. So in Particle Analytics we are trying to address this challenge.

So, a single simulation for a silo, and my model for a silo, has to calculate the interactions between every single particle (in the order of millions), in very small time intervals. That takes huge computing power. So for instance one of our clients, Astec, works on asphalt dryer/mixer technology and we are using particle analytics to enable them to establish and achieve new energy-based KPIs (Key Performance Indicators) that could make enormous savings per machine per year, purely by optimising to different analytics.

So we look at spatial/temporal filters, multiscale analysis, and reduce data size/noise. The Data operators generate new insights and KPIs. So the cost of simulation is going down, and the insights are increased.

Steven Revill, CEO & Co-Founder, Urbantide

I’m here to talk to you about our platform USmart which is making smart data. How do we do this? Well, when we started a few years ago we recognised that our businesses, organisations, and places, would be helped by artificial intelligence based on data. That requires increased collaboration around data and increasing reuse of data. Too often data is in silos, and we need to break it out and share it. But we also need to be looking at real time data from IoT devices.

So, our solution is USmart. It collects data from any source in real time, and we create value with automatic data pipelines with analytics, visualisation and AI ready. And that enables collaboration – either with partners in a closed way, or as open data.

So, I want to talk about some case studies. Firstly Smartline, which is taking housing data to identify people at risk of, or in, fuel poverty. We have 80m data points so far, and we expect to reach up to 700m+ soon. This data set is open and when it goes live we think it will be the biggest open data set in the UK.

Cycling Scotland is showing the true state of cycling, helping them to make their case for funding and gain insght.

And we are working with North Lanarkshire Council on business rates, which could lead to saving of £18k per annum, but can also identify incorrect rates of £!00k+ value.

If you want to find out more do come and talk to me, take a look at USmart, and join the USmart community.

Martina Pugliese, Data Science Lead, Mallzee

I am data science lead for Mallzee – proudly established and run from Edinburgh. Mallzee is an app for clothes, allowing you to like or dislike a product. We show you 150+ brands. We’ve had 1.4m downloads, 500m ratings on products, 3m products rated. The app allows you to explore products, but it also acts as a data collection method for us and for our B2B offering to retailers. So we allow you to product test, very swiftly, your products before they hit the market.

Why do this? Well there are challenges that are two sides of the same coin: Overstock where you have to discount and waste money; and Understock where you have too little of the best stock and that means you don’t have tine to make the best return on your products.

As well as gathering data, we also monitor the market for trends in pricing, discounting, something new happening… So for instance only 50.8% of new products last quarter were sold at full price. We work to help design, buying and merchandising teams improve this rate by 6-10% through customer feedback.

So, data is our backbone. For the consumer we enable discovery, we personalise the tool to you – it should save you time and money. At the same time the data also enables performance prediction. We have granular user segmentation. And it goes back to you – the best products go on the market. And long term that should have a positive environmental impact in reducing waste.

Maggie Philbin: Thank you. I’m going to ask you to feedback on each others ideas and work.

Carlos: I’m new to the data science world, so for me I need to learn more – and these presentations are so useful for that.

Martina: This is really useful for me, and great to see that lots of different things going on.

Matt: My work focuses on smart cities, so naturally interested in Steven’s presentation. Less keen on problematising the city.

Steven: Really interesting to discuss things backstage, but also exciting to hear Martina talking about how central data is for your business right now.

Maggie: And that is part of the wonderful things about being at Data Fest, that opportunity to learn from and hear from each other, to network and share.

We are back from lunch with a video on work in the Highlands and Islands using ambient technologies to predict likelihood of falls etc. 

Transforming Sectors With Data-Enabled Innovation – Orsola De Marco, Head of Startups, Open Data Institute

I’m going to talk about transforming sectors with data. The ODI, founded by Tim Berners-Lee and Nigel Shadbolt, focuses on data and what data enables.We think about data as infrastructure. If you think of data as roads you see that the number of roads do not matter as much as how they are connected… In the context of data we need data that can be combined, that is structured for connection and combination. And we look at data through open data and open innovation. What the ODI’s work has in common is that open innovation is at the core. This is not just about innovating, but also about making your organisation more porous, bringing in the outside. And I love the phrase “if you are the smartest person in the room, then you are in the wrong room”, because so often innovation comes from collaboration and from the outside.

Open innovation has huge potential value. McKinsey in 2013 predicted $3-5 trillian impact of open data; Lateral Economics (2014) puts that at more like $20 tn.

When we talk about open innovation and collaboration, we can talk about the corporate-startup marriage. We used to see linear solution having good returns, but that is no longer the case. Problems are now much more complex, and startups are great at innovation, at thinking laterally, at finding new approaches. But corporates have scale, they have reach, and they have knowledge of their industries and markets. If you bring these two together, it’s clear you can bring a good opportunity to live.

As example I wanted to share here is Transport for London who wanted to release open data to enable startups and SMEs to use it. CityMapper is one of the best known of these tools built on the data. Last year, after several years of open data, they commissioned a Deloitte report (2017) that this release had generated huge savings for TfL.

Another example is Arup. Historically their innovation had been taking place in house. They embraced a more open approach, and worked with two of our start ups Macedon C and Smart Sensors. Macedon C helped Arup explore airport data so that Arup didn’t need to do that processing. Smart Sensors installed 200 IoT sensors, sharing approaches to those sensors, what it means to implement IoT in buildings, how they could use this technology. And they rolled them out to some of their services.

Those are some examples. We’ve worked with 120 startups across the world. And they have generated over £37.2M in sales and investment. These are real businesses bringing real value – not just a guy in a shed. The major challenge is on the supply side of the data. A lot of companies are reluctant to share, mentioning three blockers: (1) it feels very risky to open data up – that issue feels highly relevant this week; (2) its expensive to do especially if you don’t know the value coming back; (3) perceived lack of data literacy and skills. Those are all important… But if you lead and innovate, you get to set the tone for innovation in your sector.

The idea of disruption is raised a lot, but it is real. But to actually disrupt you do really need a culture of open innovation is essential to lead. It needs to be brought in at senior level and brought into the sector.

Data infrastructure can transform sectors. And joining forces between data suppliers and users are important there. For instance we are working on a project called Open Active, with Sport England. A lack of information on what was going on in different areas was an issue for people getting active. We were involved at the outset and could see that data was the blocker here… If you tried to aggregate information it was impossible. So, in the first year of the programme we brought providers into the room, agreed an open standard, and that enabled aggregation of data. We are now in the second phase and, now that the data is consistent and available, we are bringing start ups in to engage and do things with that data. And those start ups aren’t all in sports, some are in healthcare sector – using sports data to augment information shared by medics. And from leisure companies helping individuals to find things to do with their spare time.

Another example is the Open Banking sector. Over 60% of UK banking customers haven’t changed their bank account in 5 years. And many of those haven’t changed them in 20 years. So this initiative enables customers to grant secure access to your banking details for e.g. mortgage lenders, or to enable marketplaces to offer energy switching companies. Our experience in this programme was to facilitate these banks, and took that experience of data portability… And now we are working with Mexico on a FinTech law that requires all banks to have an open API.

In order to innovate in sectors it’s important to widen access to data. This doesn’t mean not taking data privacy seriously, or losing competitive advantage.

And I wanted to highlight a very local programme. Last year we began a project in the peer to peer accommodation market. The Scottish expert advisory panel noted that whilst a lot of data is generated, no real work is looking at the impact of the sharing economy in accommodation. That understanding will enable policy decisions tied to real concerns. We will be making recommendations on this very soon. If you are interested, do get in touch and be part of this.

Q&A

Q1) You talked a lot about the value of data. How do you measure that economic value like that?

A1) We base value on sales and investment generated, and/or time or money saves in processes. It’s not an exact science but it looks for changes to the status quo.

Q2) What is the most important and valuable thing from your experience here?

A2) I think I’ll approach that answer in two ways. We do innovate work with data but we often facilitate conversations between data provider and start ups. For making data available we remove those blockers; for start ups it’s helping that facilitate those conversations, it’s helping them grow and develop and tailoring that support.

Q3) What next?

A3) Our model is a sector transformation model. We talk to a sector about sharing and opening up, and then we have start ups in an accelerator so that data will find a use. That’s a huge difference from just publishing the data and wondering what will happen to it.

Designing Things with Spending Power – Chris Speed, Chair of Design Informatics, University of Edinburgh

I have a fantastic team of designers and developers, and brilliant students who ask questions, including what things will be like in Tomorrow’s World!  We look at all kinds of factors here around data. So I want to credit that team.

Many of you in the room will be aware that data is about value constellations, rather than value chains. These are complex markets, many players – which may be humans but also which may be bots. That changes our capacity to construct value, since we have agents that construct value. And so I will talk about four objects to look at the disruption that can be made, and what that might mean, especially as they gain agency, to gain power. One of the things we thought was, what happens when we give things spending power.

See diagram from Rand organisation comparing centralised with decentralised and distributed – we see this model again and again… But things drift back occasionally (there’s only one internet banking platform now, right?). I’m going to show this 2014 bitcoin blockchain transaction video – they move too fast to screengrab these days! So… what happens when we have distributed machines with spending power? And when transactions go down to absolutely tiny transactions and amount of money.

So, we run BlockExchange workshops, with lego, to work on the idea of blockchain, what it means to be a distributed transaction system.

Next we have the fun stuff… What happens when we have things like Ethereum… And smart contracts. What could you do with digital wallets. If the UN gives someone a digital password, do they need sovereignty. So, we undertake bodily experiments with this stuff. We ran a physical experiment – body storming – with bitcoin wallets and smart contracts… A bit like Pokemon Go but with cash – if you hit a hotspot the smart contract assigns you money, Or when you enter a sink, you lose bitcoin. So, here is video of our GeoCoin app and also an experiment running in Tel Aviv.

These three banking volunteers design to design a new type of cinema experience… They enter the cinema by watching two trailers that are pickupable in the street… Another colleague decides not to do this… They gain credit by tweeting about trailers… bodystorming allows new ideas to be developed (confusingly, there is no cinema… This is, er, a cinema of the mind – right Chris?). 

Next we have a machine with a bitcoin wallet. Programmable money allows us to give machines buying power… Blockchain changes the history to things, adding value to value… So, we set up a coffee machine Bitbarista, with an interface that asks the coffee drinker to make decisions about what kind of coffee they want, what values matter… Mediating the space between values and value.

We have hairdryers – these are new and have just gone to the Policy Unit this week. We have Gigbliss Plus hairdryer… That allows you to buy and trade energy and to dry your hair when energy is cheaper… What happens when you do involve the public in balancing energu. And we have another hairdryer… That asks whether you want unethical energy now, or whether you want to wait for an ethical source – the hairdryer switches on accordingly. And then we have Gigbliss Auto, which has no buttons. You don’t have control, only the bitcoin wallet has decision powers… You don’t know when it comes on… But it will. But it changes control. Of those three hairdryers, which are we happy to move to… Where do we feel happy here.

And then we have KASH cups, with chips in them. You can only but coffee when you put two cups down. So you get credit, through the cups digital wallet, to encourage network and development. You don’t have to get copy – you can build up credit. We had free coffee in the other room… But we had a very fancy barista for the KASH cups, and people queued for this for 20 minutes – coffee with social value.

Questions for us… We give machines agency, and credit… What does that mean for value and how we balance value.

Maggie: It’s at this point I wish Tomorrow’s World still existed!

Q&A

Q1) where is this fascinating work taking you?

A1) I think this week has been so disruptive in terms of data and technologies disruption of social, civic, political values. I think understanding that we can’t balance value, or fair trade, etc. on our own is helpful and I’m really excited by what bots can offer here…

Q2) I was fascinated by the hairdryers… I’ve been in the National Grid’s secret control room and seeing that, that thing of Eastenders finishes and we make a cup of tea means bringing a whole power station on board… But waiting 10 minutes might avoid that need. It’s not trivial it’s huge.

A2) Yes, and I think understanding how that waiting, or understanding consequences of actions would have a real impact. The British public are pretty conscious and ethical I think, when they have that understanding…

Q3) Have you thought about avoiding queues with blockchain?

A3) We don’t want to just play incentives to get people out of queues. People are there for different reasons, different values, some people enjoy the sociability of a queue… Any chance to open it up, smash it up, and offer the opportunity to co-construct is great. But we need to do that with people not just algorithms.

Maggie: At this point I should be introducing Cathy O’Neil, but she has been snowed in by 15 inches of snow on the East Coast of the US. So, she will come over at a later date and you’ll all be invited. So, in place of that we have a panel on the elephant in the room, the Facebook and Cambridge Analytica scandal, with a panel on data and ethics.

Panel session: The Elephant in the Room: What Next? – Jonathan Forbes (JF), CTO, Merkle Aquila (chair); Brian Hills (BH), Head of Data, The Data Lab; Mark Logan (ML), Former COO Skyscanner, Investor and Advisor to startups and scale ups; Mhairi Aitken (MA), Research Fellow, University of Edinburgh. 

JF: So, thinking of that elephant in the room.. That election issue… That data use. I want to know what Facebook could have done better?

ML: It has taken them a long time to respond, which seems strange… But I see it as a positive really. They see this as a much bigger issue rather than the transactional elements here. In that room you look at risk and you look at outrage. I think Facebook were trying to figure out why outrage was so high, I think that’s what has surprised them. I think they took time to think about what was happening to them. I don’t think it’s just about electing a game show host to president… The outrage is different. Cambridge Analytica is a bad actor, not just on data but on their advocacy for other problematic tactics. Facebook shouldn’t be bundled into that. I think aspects here is that you have a monopoly. Facebook is an advertising company – they need to generate data and pass it onto app developers. Those two things don’t totally aligned. And I think the outrage is about trust and expectation of users.

JF: You are closest to the public in your research. The share price is dropping significantly right now… How, based on past experience, do you see this playing out.

MS: I’m used to talking to people about public sector use of data. Often people talk about Facebook data and make two points: firstly that they contribute their own data and control  that and know how it’s used; but they also have very high expectations of use for public sector organisations and don’t have that for private sector organisations – they think someone will generate ads and profit but when used in politics that’s very different, and that changes expectations.

JF: I enjoyed your comment about the social license… and I think this may be a sign that the license is being withdrawn. The GDPR legislation certainly changes some things there. I was interested to see Tim Berners Lee’s response, taking Mark Zuckerberg’s perspective… I was wondering, Brian, about the commercial pressures and the public pressures here. Are they balancing that well?

BH: No. When we look back I think this will be a pivotal moment. I kind of feel like GDPR piece is like being in a medieval torture chamber… We have a countdown but the public don’t know much about it. With Facebook it’s like we have a firework in the sky and people are asking what on earth is going on… And we have an opportunity to have a discussion about the use of data. As we leave today we have a challenge around communicate our work with data, what are our responsibilities here. The big data thing, many business cases seem like we’ve failed – we’ve focused on the technology and only that. And I feel we now have an opportunity and a window here.

JF: I’d like to take the temperature of the room… How many of you had Facebook on their phone, and don’t this week? None.

ML: I think that’s the point. The idea of not doing to others data what you wouldn’t want done to your own… But the reality is that legislation is playing catch up to practice. Commercially it’s hard to do the right thing. I think Mark Zuckerberg has reasonably good intentions here… But we have this monopoly… The parallel here is banking. And monopoly legislation hasn’t kept pace with the monopolies we have. I think it would be great if you could export your data, friends data, etc. to another platform. But we can’t.

Comment: I think you asked the wrong question… Who here doesn’t Facebook on their phone at all. Actually quite a lot. I think actually we have that sense that power corrupts and absolute power corrupts absolutely. And I don’t feel I’m missing out, I’m sure others feel that too. And I’m unsurprised about Facebook, I could see where it was going.

JF: OK, so moving towards what we can do, should we have a code of conduct, a hypocratic oath to data, a “do no harm”.

BH: I don’t see ethics featuring in data models. I think we have to build that in. Cathy O’Neil talks about Weapons of Math Destruction… We have to educate our data science students how to use these tools ethically, to think about who they will work with. Cathy was a Quant and didn’t like that so she walked away. We have to educate our students about the choices they make. We talk about optimisation, optimisation of marketing. In optimising STEM stuff… And we are missing stuff… I think we need to move towards STEAM, where A is for Arts. We have to be inclusive for arts and humanities to work with these teams, to think about skills and diversity of skills.

JF: Particularly thinking about healthcare

MA: There is increasing drive to public engagement, to public response. That has to be much more at the heart of training for data scientists and how it relates to the society we want to create. There can be a sense of slowing momentum, but it’s fundamental to getting things right, and shaping directions of where we are going…

JF: Mark, you mentioned trust, and your organisation has been very focused on trust.

ML: These multifacet networks are built on trust. For Skyscanner trust was so much more important than favouring particular clients. I think Facebook’s error has been to not be more transparent in what they do. We have had comments about machine learning as hype, but actually machine learning is about machines learning to do something without humans. We are moving to a place where decisions will be made by machines. We have to govern that, and to police machines with other machines. And we have to have algorithms to ensure that machine learning is appropriate and ethical.

JF: I agree. It was interesting to me that Weapons of Math Destruction is the top seller in algorithms and programme – a machine generated category – but that is reassuring that those working in this space are reading about this. By show of hands how many here working in data science are thinking about ethics. Some are. But unclear who isn’t working with data, or who isn’t working ethical. So, to finish I want your one takeaway for this week.

BH: I think it’s up to us to decide how to do things differently, and to make the change here. If we are true data warriors driving societal benefit then we have to make that change ourselves.

ML: We do plenty to mess up the planet. I think machine learning can help us sort out the problems we’ve created for ourselves.

MA: I think its been a wonderful event, particularly the variety and creativity being shared. And I’m really pleased to open up these conversations and look at these issues.

JF: I’m optimistic too. But don’t underestimate the ability of a small group of committed people to change the world. So, Data Warriors, all of you… You know what to do!

Maggie: Thank you all for your conversation, your enthusiasm. One message I really want to give you is that when you look at the use of data, the capacity to do good… The vast majority of young people are oblivious. They could miss out on an amazing career. But as the world changes, they could miss out on a decent career without these skills. Don’t underestimate your ability as one person with knowledge of that area to make a difference, to influence and to inspire. A few years back, in Greenock, we ran an event with Teen Tech and the support of local tech companies made all the difference… One team went to the finals in London, won and went to Silicon Valley… And that had enormous impact on that school and community, and now all S2 students do that programme, local companies come in for a Dragon’s Den type set up. Any moment that you can inspire and support those kids will make all the difference in those lives, and can make all the difference, especially if family, parents, community don’t know about data and tech.

Closing Comments – Gillian Docherty, CEO, The Data Lab

Firstly thank you to Maggie for being an amazing host!

I have a few thank yous to make. It has been an outstanding week. Thank you all for participating in this event. This has been just one event of fifty. We’ve had another 3000 data warriors, on top of you 450 data warriors for Data Summit. Thank you to our amazing speakers, and exhibitors. The buzz has been going throughout the event. Thank you to our sponsors, and to Scottish Government and Scottish Enterprise. Thank you to our amazing volunteers, to Grayling who has been working with the press. To our venue, events team and caterers. Our designer from two fifths design. And the team at FutureX who helped us organise Data Talent and Data Summit – absolutely outstanding job! Well done!

And two final thank yous. Firstly the amazing Data Lab team. We have thousands of new people being trained, huge numbers of projects. I also want to specifically mention Craig Skelton who coordinated our Fringe events; Cecilia who runs our marketing team; and Fraser and John who were behind this week!

My final thank you is to all of you, including the teams across Scotland participating. It is a fantastic time to be working in Scotland! Now take that enthusiasm home with you!

 March 23, 2018  Posted by at 10:48 am Events Attended, LiveBlogs Tagged with: , , ,  No Responses »
Oct 132015
 
Michael Dewar, Data Scientist at The New York Times, presenting at the Data Science for Media Summit held by the Alan Turing Institute and University of Edinburgh, 14th October 2015..

Today I am at the “Data Science for Media Summit” hosted by The Alan Turing Institute & University of Edinburgh and taking place at the Informatics Forum in Edinburgh. This promises to be an event exploring data science opportunities within the media sector and the attendees are already proving to be a diverse mix of media, researchers, and others interesting in media collaborations. I’ll be liveblogging all day – the usual caveats apply – but you can also follow the tweets on #TuringSummit.

Introduction – Steve Renals, Informatics

I’m very happy to welcome you all to this data science for media summit, and I just wanted to explain that idea of a “summit”. This is one of a series of events from the Alan Turing Institute, taking place across the UK, to spark new ideas, new collaborations, and build connections. So researchers understanding areas of interest for the media industry. And the media industry understanding what’s possible in research. This is a big week for data science in Edinburgh, as we also have our doctoral training centre so you’ll also see displays in the forum from our doctoral students.

So, I’d now like to handover to Howard Covington, Chair, Alan Turing Institute

Introduction to the Alan Turing Institute (ATI) – Howard Covington, Chair, ATI

To introduce ATI I’m just going to cut to out mission, to make the UK the world leader in data science and data systems.

ATI came about from a government announcement in March 2014, then bidding process leading to universities chosen in Jan 2015, joint venture agreement between the partners (Cambridge, Edinburgh, Oxford, UCL, Warwick) in March 2015, and Andrew Blake, the institute’s director takes up his post this week. He was before now the head of research for Microsoft R&D in the UK.

Those partners already have about 600 data scientists working for them and we expect ATI to be an organisation of around 700 data scientists as students etc. come in. And the idea of the data summits – there are about 10 around the UK – for you to tell us your concerns, your interests. We are also hosting academic research sessions for them to propose their ideas. 

Now, I’ve worked in a few start ups in my time and this is going at pretty much as fast a pace as you can go.

We will be building our own building, behind the British Library opposite the Frances Crick building. There will be space at that HQ for 150 peaople. There is £67m of committed funding for the first 5 years – companies and organisations with a deep interest who are committing time and resources to the institute. And we will have our own building in due course.

The Institute sits in a wider ecosystem that includes: Lloyds Register – our first partner who sees huge amounts of data coming from sensors on large structures; GCHQ – working with them on the open stuff they do, and using their knowledge in keeping data safe and secure; EPSRC – a shareholder and partner in the work. We also expect other partners coming in from various areas, including the media.

So, how will we go forward with the Institute? Well we want to do both theory and impact. So we want major theoretical advances, but we will devote time equally to practical impactful work. Maths and Computer Science are both core, but we want to be a broad organisation across the full range of data science, reflecting that we are a national centre. But we will have to take a specific interest in particular interest. There will be an ecosystem of partners. And we will have a huge training programme with around 40 PhD students per year, and we want those people to go out into the world to take data sciences forward.

Now, the main task of our new director, is working out our science and innovation strategy. He’s starting by understanding where our talents and expertise already sit across our partners. We are also looking at the needs of our strategic partners, and then the needs emerging from the data summits, and the academic workshops. We should then soon have our strategy in place. But this will be additive over time.

When you ask someone what data science is that definition is ever changing and variable. So I have a slide here that breaks the rules of slide presentations really, in that it’s very busy… But data science is very busy. So we will be looking at work in this space, and going into more depth, for instance on financial sector credit scoring; predictive models in precision agriculture; etc. Undercutting all of these is similarities that cross many fields. Things like security and privacy is one such area – we can only go as far as it is appropriate to go with people’s data, and issue both for ATI and for individuals.

I don’t know if you think that’s exciting, but I think it’s remarkably exciting!

We have about 10 employees now, we’ll have about 150 this time next year, and I hope we’ll have opportunity to work with all of you on what is just about the most exciting project going on in the UK at the moment.

And now to our first speaker…

New York Times Labs – Keynote from Mike Dewar, Data Scientist

I’m going to talk a bit about values, and about the importance of understanding the context of what it is we do. And how we embed what we think is important into the code that we write, the systems that we design and the work that we do.

Now, the last time I was in Edinburgh was in 2009 I was doing a Post Doc working on modelling biological data, based on video of flies. There was loads of data, mix of disciplines, and we were market focused – the project became a data analytics company. And, like much other data science, it was really rather invasive – I knew huge amounts about the sex life of fruit flies, far more than one should need too! We were predicting behaviours, understanding correlations between environment and behaviour. I’

I now work at the New York Times R&D and our task is to look 3-5 years ahead of current NYT practice. We have several technologists there, but also colleagues who are really designers. That has forced me up a bit… I am a classically trained engineer – to go out into the world, find the problem, and then solve it by finding some solution, some algorithm to minimise the cost function. But it turns out in media, where we see decreasing ad revenue, and increasing subscription, that we need to do more than minimise the cost function… That basically leads to click bait. So I’m going to talk about three values that I think we should be thinking about, and projects within that area. So, I shall start with Trust…

Trust

It can be easy to forget that much of what we do in journalism is essentially surveillance, so it is crucial that we do our work in a trustworthy way.

So the first thing I want to talk about is a tool called Curriculum, a Chrome browser plug in that observes everything I read online at work. Then it takes chunk of text, aggregates with what others are reading, and projects that onto a screen in the office. So, firstly, the negative… I am very aware I’m being observed – it’s very invasive – and that layer of privacy is gone, that shapes what I do (and it ruins Christmas!). But it also shares what everyone is doing, a sense of what collectively we are working on… It is built in such a way as to make it inherently trustworthy in four ways: it’s open source so I can see the code that controls this project; it is fantastically clearly written and clearly architected so reading the code is actually easy, it’s well commented, I’m able to read it; it respects existing boundaries on the web – it does not read https (so my email is fine) and respects incognito mode; and also I know how to turn it off – also very important.

In contrary to that I want to talk about Editor. This is a text editor like any others… Except whatever you type is sent to a series of micro services which looks for similarity, looking for NYT keyword corpos, and then sends that back to the editor – enabling a tight mark up of their text. The issue is that the writer is used to writing alone, then send to production. Here we are asking the writer to share their work in progress and send it to central AI services at the NYT, so making that trustworthy is a huge challenge, and we need to work out how best to do this.

Legibility

Data scientists have a tendency towards the complex. I’m no different – show me a new tool and I’ll want to play with it and I enjoy a new toy. And we love complex algorithms, especially if we spent years learning about those in grad school. And those can render any data illegible.

So we have [NAME?] an infinite scrolling browser – when you scroll you can continue on. And at the end of each article an algorithm offers 3 different recommendation strands… It’s like a choose your own adventure experience. So we have three recommended articles, based on very simple recommendation engine, which renders them legible. These are “style graph” – things that are similar in style; “collaborative filter” – readers like you also read; “topic graph” – similar in topic. These are all based on the nodes and edges of the connections between articles. They are simple legible concepts, and easy to run so we can use them across the whole NYT corpus. They are understandable to deal with so has a much better chance of resonating with our colleagues.

As a counter point we were tasked with looking at Behavioural Segmentation – to see how we can build different products for them. Typically segmentation is done with demography. We were interested, instead, on using just the data we had, the behavioural data. We arranged all of our pageviews into sessions (arrive at a page through to leave the site). So, for each session we representated the data as a transition matrix to understand the probability of moving from one page to the next… So we can perform clustering of behaviours… So looking at this we can see that there are some clusters that we already know about… We have the “one and dones” – read one article then move on. We found the “homepage watcher” who sit on the homepage and use that as a launching point. The rest though the NYT didn’t have names for… So we now have the “homepage bouncer” – going back and forth from the front page; and the “section page starter” as well, for instance.

This is a simple caymeans (?) model and clustering, very simple but they are dynamic, and effective. However, this is very very radical at NYT, amongst non data scientist. It’s hard to make it resonate to drive any behaviour or design in the building. We have a lot of work to do to make this legible and meaningful for our colleagues.

The final section I want to talk about is Live…

Live

In news we have to be live, we have to work in the timescales of seconds to a minute. In the lab that has been expressed as streams of data – never ending sequences of data arriving at our machines as quickly as possible.

So, one of our projects, Delta, produces a live visualisation of every single page views of the NYT – a pixel for person starting on the globe, then pushing outwards, If you’ve visited the NYT in the last year or so, you’ve generated a pixel on the globe in the lab. We use this to visualise the work of the lab. We think the fact that this is live is very visceral. We always start with the globe… But then we show a second view, using the same pixels in the context of sections, of the structure of the NYT content itself. And that can be explored with an XBox controller. Being live makes it relevant and timely, to understand current interests and content. It ties people to the audience, and encourages other parts of the NYT to build some of these live experiences… But one of the tricky things of that is that it is hard to use live streams of data, hence…

Streamtools, a tool for managing livestreams of data. It should be reminscent of Similink or LabView etc. [when chatting to Mike earlier I suggested it was a superpimped, realtime Yahoo Pipes and he seemed to agree with that description too]. It’s now on it’s third incarnation and you can come and explore a demo throughout today.

Now, I’ve been a data scientist and involved when we bring our systems to the table we need to be aware that what we build embodies our own values. And I think that for data science in media we should be building trustworthy systems, tools which are legible to others, and those that are live.

Find out more at nytlabs.com.

Q&A

Q1) I wanted to ask about expectations. In a new field it can be hard to manage expectations. What are your users expectations for your group and how do you manage that?

A1) The expectations in R&D, in which we have one data scientist and a bunch of designers. We make speculative futures, build prototypes, bring them to NYT, to the present, to help them make decisions about the future. In terms of data science in general at NYT… Sometimes things look magic and look lovely but we don’t understand how they work, in other places it’s much simpler, e.g. counting algorithms. But there’s no risk of a data science winter, we’re being encouraged to do more.

Q2) NYT is a paper of record, how do you manage risk?

A2) Our work is informed by a very well worded privacy statement that we respect and build our work on. But the other areas of ethics etc. is still to be looked at.

Q3) Much of what you are doing is very interactive and much of data science is about processing large sets of data… So can you give any tips for someone working with Terrabytes of data for working with designers?

A3) I think a data scientist essentially is creating a palate of colours for your designer to work with. And forcing you to explain that to the designer is useful, and enables those colours to be used. And we encourage that there isn’t just one solution, we need to try many. That can be painful as a data scientist as some of your algorithms won’t get used, but, that gives some great space to experiment and find new solutions.

Data Journalism Panel Session moderated by Frank O’Donnell, Managing Editor of The Scotsman, Edinburgh Evening News and Scotland on Sunday

We’re going to start with some ideas of what data journalism is

Crina Boros, Data Journalist, Greenpeace

I am a precision journalist.  and I have just joined Greenpeace having worked at Thomson Reuters, BBC Newsnight etc. And I am not a data scientist, or a journalist. I am a pre-journalist working with data. At Greenpeace data is being used for investigate journalism purposes, areas no longer or rarely picked up by mainstream media, to find conflicts of interest, and to establish facts and figures for use in journalism, in campaigning. And it is a way to protect human sources and enable journalists in their work. I have, in my role, both used data that exists, created data when it does not exist. And I’ve sometimes worked with data that was never supposed to see the light of data.

Evan Hensleigh, Visual Data Journalist, The Economist

I was originally a designer and therefore came into information visualisation and data journalism by a fairly convoluted route. At the Economist we’ve been running since the 1890s and we like to say that we’ve been doing data science since we started. We were founded at the time of the Corn Laws in opposition to those proposals, and visualised the impact of those laws as part of that.

The way we now tend to use data is to illustrate a story we are already working on. For instance working on articles on migration in Europe, and looking at fortifications and border walls that have been built over the last 20 to 30 years lets you see the trends over time – really bringing to life the bigger story. It’s one thing to report current changes, but to see that in context is powerful.

Another way that we use data is to investigate changes – a colleague was looking at changes in ridership on the Tube, and the rise of the rush hour – and then use that to trigger new articles.

Rachel Schutt, Chief Data Scientist, Newscorp

I am not a journalist but I am the Chief Data Scientist at Newscorp, and I’m based in New York. My background is a PhD in statistics, and I used to work at Google in R&D and algorithms. And I became fascinated by data science so started teaching an introductory course at Columbia, and wrote a book on this topic. And what I now do at Newscorp is to use data as a strategic asset. So that’s about using data to generate value – around subscriptions, advertising etc. But we also have data journalism so I increasingly create opportunities for data scientists, engineers, journalists, and in many cases a designer so that they can build stories with data at the core.

We have both data scientists, but also data engineers  – so hybrid skills are around engineering, statistical analysis, etc. and sometimes individual’s skills cross those borders, sometimes it’s different people too. And we also have those working more in design and data visualisation. So, for instance, we are now getting data dumps – the Clinton emails, transcripts from Ferguson etc. – and we know those are coming so can build tools to explore those.

A quote I like is that data scientists should think like journalists (from DJ Patel) – in any industry. In Newscorp we also get to learn from journalists which is very exciting. But the idea is that you have to be investigative, be able to tell a story, to

Emily Bell says “all algorithms are editorial” – because value judgements are embedded in those algorithms, and you need to understand the initial decisions that go with that.

Jacqui Maher, Interactive Journalist, BBC News Labs
I was previously at the NYT, mainly at the Interactive News desk in the newsroom. An area crossing news, visualisation, data etc. – so much of what has already been said. And I would absolutely agree with Rachel about the big data dumps and looking for the story – the last dump of emails I had to work with were from Sarah Palin for instance.

At the BBC my work lately has been on a concept called “Structured Journalism” – so when we report on a story we put together all these different entities in a very unstructured set of data as audio, video etc. Many data scientists will try to extract that structure back out of that corpus… So we are looking at how we might retain the structure that is in a journalist’s head, as they are writing the story. So digital tools that will help journalists during the investigative process. And ways to retain connections, structures etc. And then what can we do with that… What can make it more relevant to readers/viewers – context pieces, ways of adding context in a video (a tough challenge).

If you look at work going on elsewhere, for instance at the Washington Post working on IS, are looking at how to similarly add context, how they can leverage previous reporting without having to do that from scratch.

Q&A/Discussion

Q1 – FOD) At a time when we have to cut staff in media, in newspapers in particular, how do we justify investing in data science, or how do we use data science.

A1 – EH) Many of the people I know came out of design backgrounds. You can get pretty far just using available tools. There are a lot of useful tools out there that can help your work.

A1 – CB) I think this stuff is just journalism, and these are just another sets of tools. But there is a misunderstanding that you don’t press a button and get a story. You have to understand that it takes time,  there’s a reason that it is called precision journalism. And sometimes the issue is that the data is just not available.

A1 – RS) Part of the challenge is about traditional academic training and what is and isn’t included here.. But there are more academic programmes on data journalism. It’s a skillset issue. I’m not sure that, on a pay basis, whether data journalists should get paid more than other journalists…

A1 – FOD) I have to say in many newsrooms journalists are not that numerate. Give them statistics, even percentages and that can be a challenge. It’s almost a badge of honour as wordsmiths…

A1 – JM) I think most newsrooms have an issue of silos. You also touched on the whole “math is hard” thing. But to do data journalism you don’t need to be a data scientist. They don’t have to be an expert on maths, stats, visualisation etc. At my former employer I worked with Mike – who you’ve already heard from – who could enable me to cross that barrier. I didn’t need to understand the algorithms, but I had that support. You do see more journalist/designer/data scientists working together. I think eventually we’ll see all of those people as journalists though as you are just trying to tell the story using the available tools.

Q2) I wanted to ask about the ethics of data journalism. Do you think that to do data journalism there is a developing field of ethics in data journalism?

A1 – JM) I think that’s a really good question in journalism… But I don’t think that’s specific to data journalism. When I was working at NYT we were working on the Wikileaks data dumps, and there were huge ethical issues there and around the information that was included there in terms of names, in terms of risk. And in the end the methods you might take – whether blocking part of a document out – the technology mignt vary but the ethical issues are the same.

Q2 follow up FOD) And how were those ethical issues worked out?

A1 – JM) Having a good editor is also essential.

A1 – CB) When I was at Thomson Reuters I was involved in running womens rights surveys to collate data and when you do that you need to apply research ethics, with advice from those appropriately positioned to do that.

A1 – RS) There is an issue that traditionally journalists are trained in ethics but data scientists are not trained in ethics. We have policies in terms of data privacy… But much more to do. And it comes down to the person who is building a data model – ad you have to be aware of the possible impact and implications of that model. And risks also of things like the Filter Bubble (Pariser 2011).

Q3 – JO) One thing that came through listening to ? and Jackie, it’s become clear that journalism is a core part of journalism… You can’t get the story without the data. So, is there a competitive advantage to being able to extract that meaning from the data – is there a data science arms race here?

A3 – RS) I certainly look out to NYT and other papers I admire what they do, but of course the reality is messier than the final product… But there is some of this…

A3 – JM) I think that if you don’t engage with data then you aren’t keeping up with the field, you are doing yourself a professional misservice.

A3 – EH) There is a need to keep up. We are a relatively large group, but nothing like the scale of NYT… So we need to find ways to tell stories that they won’t tell, or to have a real sense of what an Economist data story looks like. Our team is about 12 or 14, that’s a pretty good side.

A3 – RS) Across all of our businesses there are 100s in data science roles, of whom only a dozen or so are on data journalism side.

A3 – JM) At the BBC there are about 40 or 50 people on the visual journalism team. But there are many more in data science in other roles, people at the World Service. But we have maybe a dozen people in the lab at any given moment.

Q4) I was struck by the comment about legibility, and a little bit related to transparancy in data. Data is already telling a story, there is an editorial dimension, and that is added to in the presentation of the data… And I wonder how you can do that to improve transparancy.

A4 – JM) There are many ways to do that… To show your process, to share your data (if appropriate). Many share code on GitHub. And there is a question there though – if someone finds something in the data set, what’s the feedback loop.

A4 – CB) In the past where I’ve worked we’ve shared a document on the step by step process used. I’m not a fan of sharing on GitHub, I think you need to hand hold the reader through the data story etc.

Q5) Given that journalims is about holding companies to account… In a world where, e.g. Google, are the new power brokers, who will hold them to account. I think data journalism needs a merge between journalism, data science, and designers… Sometimes that can be in one person… And what do you think about journalism playing a role in holding new power brokers to account.

A5 – EH) There is a lot of potential. These companies publish a lot of data and/or make their data available. There was some great work on 5:38 about Uber, based on a Freedom of Information request to essentially fact check Uber’s own statistics and reporting of activities.

Q6) Over the years we’ve (Robert Gordan Univerity) worked with journalists from various organisations. I’ve noticed that there is an issue, not yet raised, that journalists are always looking for a particular angle in data as they work with it… It can be hard to get an understanding from the data, rather than using the data to reinforce bias etc.

A6 – RS) If there is an issue of taking a data dump from e.g. Twitter to find a story… Well dealing with that bias does come back to training. But yes, there is a risk of journalists getting excited, wanting to tell a novel story, without being checked with colleagues, correcting analysis.

A6 – CB) I’ve certainly had colleagues wanting data to substantiate the story, but it should be the other way around…

Q6) If you, for example, take the Scottish Referendum and the General Election and you see journalists so used to watching their dashboard and getting real time feedback, they use them for the stories rather than doing any real statistical analysis.

A6 – CB) That’s part of the usefulness of reason for reading different papers and different reporters covering a topic – and you are expected to have an angle as a journalist.

A6 – EH) There’s nothing wrong with an angle or a hunch but you also need to use the expertise of colleagues and experts to check your own work and biases.

A6 – RS) There is a lot more to understand how the data has come about, and people often use the data set as a ground truth and that needs more thinking about. It’s somewhat taught in schools, but not enough.

A6 – JM) That makes me think of a data set called gdump(?), which captures media reporting and enables event detection etc. I’ve seen stories of a journalist looking at that data as a canonical source for all that has happened – and that’s a misunderstanding of how that data set has been collected. It’s close to a canonical source for reporting but that is different. So you certainly need to understand how the data has come about.

Comment – FOD) So, you are saying that we can think we are in the business of reporting fact rather than opinion but it isn’t that simple at all.

Q7) We have data science, is there scope for story science? A science and engineering of generating stories…

A7 – CB) I think we need a teamwork sort of approach to story telling… With coders, with analysts looking for the story… The reporters doing field reporting, and the data vis people making it all attractive and sexy. That’s an ideal scenario…

A7 – RS) There are companies doing automatic story generation – like Narrative Science etc. already, e.g. on Little League matches…

Q7 – comment) Is that good?

A7 – RS) Not necessarily… But it is happening…

A7 – JM) Maybe not, but it enables story telling at scale, and maybe that has some usefulness really.

Q8/Comment) There was a question about the ethics and the comment that nothing needed there, and the comment about legibility. And I think there is conflict there about

Statistical databases  – infer missing data from the data you have, to make valid inferences but could shock people because they are not actually in the data (e.g. salary prediction). This reminded me of issues such as source protection where you may not explicitly identify the source but that source could be inferred. So you need a complex understanding of statistics to understand that risk, and to do that practice appropriately.

A8 – CB) You do need to engage in social sciences, and to properly understand what you doing in terms of your statistical analysis, your P values etc. There is more training taking place but still more to do.

Q9 – FOD) I wanted to end by coming back to Howard’s introduction. How could ATI and Edinburgh help journalism?

A9 – JM) I think there are huge opportunities to help journalists make sense of large data sets. Whether that is tools for reporting or analysis. There is one, called Detector.io that lets you map reporting for instance that is shutting down and I don’t know why. There are some real opportunities for new tools.

A9 – RS) I think there are areas in terms of curriculum, on design, ethics, privacy, bias… Softer areas not always emphasised in conventional academic programmes but are at least as important as scientific and engineering sides.

A9 – EH) I think generating data from areas where we don’t have it. At the economist we look at China, Asia, Africa where data is either deliberately obscured or they don’t have the infrastructure to collect it. So tools to generate that would be brilliant.

A9 – CB) Understand what you are doing; push for data being available; and ask us and push is to be accountable, and it will open up…

Q10) What about the readers. You’ve been saying the journalists have to understand their stats… But what about the readers who know how to understand the difference between reading the Daily Mail and the Independent, say, but don’t have the data literacy to understand the data visualisation etc.

A10 – JM) It’s a data literacy problem in general…

A10 – EH) Data scientists have the skills to find the information and raise awareness

A10 – CB) I do see more analytical reporting in the US than in Europe. But data isn’t there to obscure anything. But you have to explain what you have done in clear language.

Comment – FOD) It was once the case that data was scarce, and reporting was very much on the ground and on foot. But we are no longer hunter gatherers in the same way… Data is abundant and we have to know how we can understand, process, and find the stories from that data. We don’t have clear ethical codes yet. And we need to have a better understanding of what is being produced. And most of the media most people consume is the local media – city and regional papers – and they can’t yet afford to get into data journalism in a big ways. Relevance is a really important quality. So my personal challenge to the ATI is: how do we make data journalism pay?

And we are back from lunch and some excellent demos… 

Ericsson, Broadcast & Media Services – Keynote from Steve Plunkett, CTO

Jon Oberlander is introducing Steve Plunkett who has a rich history of work in the media. 

I’m going to talk about data and audience research, and trends in audience data. We collect and aggregate and analyse lots of data and where many of the opportunities are…

24,000 R&D very much focused on telecoms. But within R&D there is a group of broadcast and media services, and I joined as part of a buy out of Red Bee Media. One part of these services are a metadata team who create synposes for EPGs across Europe (2700 channels). We are also the biggest subtitlers in Europe. And we also do media management – with many hundreds of thousands of hours of audio and tv and that’s also an asset we can analyse (the inventory as well as the programme). And we operate TV channels – all BBC, C4, C5, UKTV, France, Netherlands, and in US and our scheduling work is also a source of data. And we also run recommendation engines embedded in TV guides and systems.

Now, before I tak about the trends I want to talk about the audience. Part of the challenge is understanding who the audience is… And audiences change and the rate of change is accellerating. So I’ll show some trends in self-reported data from audiences on what they are watching. Before that a quote from Reed Hastings, Amazon: “TV had a great 50 year run, but now it’s time is over”. TV is still where most impact and viewing hours are but there are real changes now.

So, the Ericsson ConsumerLab Annual Report – participants across the world – 1000 consumers across 20 countries. In home interview based understanding their viewing context, what they are watching and what preferences are. Of course self reported behaviour isn’t the same as real data but we can compare and understand that.

So, the role of services varies between generations. The go-to services are very different between older generations and younger generation. For older viewers it’s linear TV, then DVR, then Play/catch-ip, then YouTube etc. For Younger Generations SVOD is top viewing services – that’s things like Netflix, Amazon Prime etc.

In terms of daily media habits we see again a real difference between use of scheduled linear TV vs. streamed and recorded TV. Younger people again much more likely to use streaming, older using scheduled much more. And we are seeing YouTube growing in importance – generally viewing over 3 hrs per day has increased hugely in the last 4 years, and it is used as a go to space to learn new things (e.g. how to fix the dishwasher).

In terms of news the importance of broadcast news increases with age – still much more important to older consumers. And programming wise 45% of streamed on demand viewing of long content is TV series. Many watch box sets for instance. As broadcasters we have to respect that pattern of use, not all are linear scheduled viewers. And you see this in trends of tweeting and peaks of tweaks of how quickly a newly released online series has been completed.

There is also a shift from fixed to mobile devices. TV Screens and desktop PCs have seen a reduction in viewing hours and use compared to mobile, tablet and laptop use. That’s a trend overtime. And that’s again following generational lines… Younger people more likely to use mobile. Now again, this is self-reported and can vary between countries. So in our broadcast planning understanding content – length of content, degree of investment in High Def etc. – should be informed by those changes. On mobile user generated content – including YouTube but also things like Periscope – still dominant.

In terms of discovering and remembering content it is still the case that friends, reviews, trailers etc. matter. But recommendation engines are important and viewers are satisfied with them. For last two years we’ve asked study group about those recommedation engines: their accuracy; their uncanniness and data and privacy concerns; and an issue of shared devices. So still much more to be done. The scale of Netflix’ library is such that recommendations are essential to help users navigate.

So, that was self-reported. What about data we create and collect?

We have subtitle coverage, often doing the blanket subtitle coverage for broadcasters. We used to use transcribers and transcription machines. We invested in respeaking technologies. And that’s what we use now and those respeakers clean up grammar etc and the technology is trained for their voice. That process of logging subtitles includes very specific timestamps… That gives us rich new data, and also creates a transcript that can sit alongside the subtitles and programme. But it can take 6-7 hours to do subtitling as a whole process, including colour coding speakers etc. And we are looking to see what else subtitlers could add – mood perhaps? etc. as part of this process.

We have a database of about 8.5 million records that include our programme summaries, images on an episode level, etc. And we are working on the system we use to manage this, to improve it.

I mentioned Media Management and we do things like automated transcription – it wouldn’t be good enough for use in broadcast but

Media RIM – 60 telecom operators use it for IPTV and collects very granular data from TV viewing – all collected with consent. Similar for OTT. And similar platforms for EPG. Search queries. Recommendations and whether acted upon. And we also have mobile network data – to understand drop off rates, what’s viewed for a particular item etc.

We are in the middle of the broadcaster and the audience, so our work feeds into broadcasters work. For insight like segmentation, commissioning, marketing, scheduling, sales… For personalisation – content recommendations, personalised channels that are unique to you, targeted advertising, search, content navigation, contextual awareness. One of the worst feedback comments we see is about delivery quality so when it comes to delivery quality we apply our data to network optimisation etc.

In terms of the challenges we face they include: consumer choice; data volumes – and growing fast so finding value matters; data diversity – very different in structure and form so complex task; expertise – there is a lack of skills embedded in these businesses to understand our data; timeliness – personal channels need fast decisions etc. real time processing is a challenge; privacy – one of the biggest ones here, and the industry needs to know how to do that and our feedback on recommendation engines is such that we need to explain where data is coming from, to make that trusted.

In terms of opportunities: we are seeing evolving technology; cloud resources are changing this fast; investment – huge in this area at the moment; consumer appetite for this stuff; and we are in an innovation white space right now – we are in early days…

And finally… An experimental application. We took Made in Chelsea and added a graph on the viewing plan that shows tweets and peaks… And provide as a navigation system based on tweets shared. And on the right hand side navigation by character and follow their journey. We created some semantic visualisation tools for e.g. happy, sad, funny moments. Navigation that focuses on the viewers interest.

Audience Engagement Panel Session – Jon Oberlander (Moderator), University of Edinburgh

Jon is introducing his own interest in data science, in design informatics, and linguistics and data science, with a particular mention for LitLong, similarly a colleague in Politics is analysing the public interest in the UK and EU, but also reaction to political messages. And finally on the Harmonium project at the Edinburgh International Festival – using music and data on musical performers to create a new music and visualisation project, with 20k in person audience and researchers monitoring and researching that audience on the night too…

Pedro Cosa – Data Insights and Analytics Lead, Channel 4

I’m here to talk a bit about the story of Channel 4 and data. Channel 4 is a real pioneer in using data in the UK, and in Europe. You’ve all heard Steve’s presentation on changing trends – and these are very relevant for Channel 4 as we are a public service broadcasting but also because our audience is particularly young and affluent. They are changing their habits quickly and that matters from an audience and also an advertising issue for us. Senior management was really pushing for change in the channel. Our CEO has said publicly that data is the new oil of the TV industry and he has invested in data insights for Channel 4. The challenge is to capture as much data as possible, and feed that back to the business. So we used registration data from All4 (was 4OD) and to use that site you have to register. We have 13 million people registered that way and so that’s already capturing details on half our target audience in the UK. And that moves us from one to many, to one to one. And we can use that for targeted advertising, and that comes with a premium paid for advertisers, and to really personalise the experience. So that’s what we are doing at the moment.

Hew Bruce-Gardyne – Chief Technology Officer, TV Squared

We are a small company working on data analytics for use by advertisers, that in turn feed back into content. My personal background is as an engineer, the big data of that side of number crunching is where I come from. From where I am sitting audience engagement is a really interesting problem… If you see a really big engaging programme that seems to kill the advertising so replays, catch up and seeing opportunities there is, for us, gold dust.

Paul Gilooly – Director of Emerging Products, MTG (Modern Times Group)

MTG are a Scandinavian pan-European broadcaster, we have the main sports and Hollywood rights as well as major free to air channels in Scandinavian countries. And we run a thing called ViPlay which is an SVOD service like (and predating) Netflix. Nordics are interest as we have high speed internet, affluent viewers, markets where Apple TV is significant, disproportionately compared to the rest of Europe. So when I think of TV I think of subscribing audience, and Pay TV. And my concern is churn – and a amore engaged customer is more likely to stick around. So any way to increase engagement is of interest, and data is a key part of that. Just as Channel 4 are looking at authentication as a data starting point, so are we. And we also want to encourage behaviours like recommendations of products and sharing. And some behaviours to discourage. And data is also the tool to help you understand behaviours you want to discourage.

For us we want to increase transactions with viewers, to think more like a merchandiser, to improve personalisation… So back to the role of data – it is a way to give us a competitive advantage over competitors, can drive business models for different types of consumer. It’s a way to understand user experience, quality of user experience, and the building of personalised experiences. And the big challenge for me is that in the Nordics we compete with Netflix, with HBO (has direct to air offering there). But we are also competing with Microsoft, Google, etc. We are up against a whole new range of competitors who really understand data, and what you can do with data.

Steve Plunkett – CTO, Broadcast & Media Services, Ericsson

No intro… as we’ve just heard from you… 

Q&A

Q1 – JO) Why are recommendations in this sector so poor compared to e.g. Amazon?

A1 – SP) The problem is different. Amazon has this huge inventory, and collective recommendation works well. Our content is very different. We have large content libraries, adn collective recommendation works differntly. We used to have human curators programming content, they introduced serendipity nad recommendation engines are less good at that. We’ve just embarked on a 12 month project with three broadcasters  to look at this. There is loads of research on public top 10s. One of the big issues is that if you get a bad recommendation it’s hard to say “I don’t like this” or “not now”, they just sit there and the feedback is poor… So important to solve. Netflix invested a great deal of money in recommendations. They invested $1 million for a recommender that would beat their own by 10% and that took a long time. Data science is aligned with that of course.

A1 – PC) Recommendations are core for us too. But TV recommendations are so much more complex than retail… You need to look at data analyse… You have to promote cleverly, to encourage discovery, to find new topics or areas of debate, things you want to surface in a relevant way. It’s an area C4 and also BBC looking to develop.

A1 – HBG) There is a real difference between retail and broadcast – about what you do but also about the range of content available… So even if you take a recommendation, it may not reflect true interest and buy in to a product. Adds a layer of complexity and cloudiness…

A1 – SP) Tracking recommendations in a multi device, multi platform space is a real challenge… Often a one way exchange. Closing loop between recommendation and action is hard…

Q2 – JO) Of course you could ask active questions… Or could be mining other streams… How noisy is that, how useful is that? Does it bridge a gap.

A2 – SP) TV has really taken off on Twitter, but there is disproportionate noise based on a particular audience and demographic. That’s a useful tool though… You can track engagement with a show, at a point of time within a show… But not neccassarily recommendations of that viewer at that time… But one of many data sets to use…

Q3 – JO) Are users engaging with your systems aware of how you use their data, are they comfortable with it?

A3 – PC) C4 we have made a clear D-Word promise – with a great video from Alan Carr that explains that data. You can understand how it is use, can delete your own data, can change your settings, and if you don’t use the platform for 2 years then we delete your data. Very clear way to tell the user that you are in control.

A3 – SP) We had a comment from someone in a study group who said they had been categorised by a big platform as a fan of 1980s supernatural horror and didn’t want to be categorised in that way, or for others to see this. So a real interest in transparancy there.

A3 – PG) We aren’t as far ahead as Channel 4, they are leading the way on data and data privacy.

Q4 – JO) Who is leading the way here?

A4 – PG) I think David Abrahms (C4) needs great credit here, CEO understands importance of data science and it’s role in the core business model. And that competitors for revenue are Facebook, Google and so forth.

Q5 – JO) So, trend is to video on demand… Is it also people watching more?

A5 – SP) It has increased but much more fragmented across broadcast, SVOD, UGC etc. and every type of media has to define its space. So YouTube etc. is eating into scheduled programming. For my 9 year old child the streaming video, YouTube etc. is her television. We are competing with a different set of producers.

A5 – PG) The issue isn’t that linear channels do not allow you to collect data. If you have to login to access content (i.e. Pay TV) then you can track all of that sort of data. So DR1, Danish TV channel and producer of The Killing etc. is recording a huge drop in linear viewing by young people, but linear still has a role for live events, sport etc.

A5 – HBG) We do see trends that are changing… Bingeathons are happening and that indicates not a shortness of attention but a genuine change. Watching a full box set is the very best audience engagement. But if you are at a kitchen table, on a device, that’s not what you’ll be watching… It will be short videos, YouTube etc.

To come back to the privacy piece I was at a conference talking about the push to ID cards and the large move to restrict what people can know about us… We may lose some of the benefits of what can be done. And on some data – e.g. Medical Informatics – there is real value that can be extracted there. We know that Google knows all about us… But if our TV knows all about us that’s somehow culturally different.

Q6) Privacy is very high, especially at younger age ranges, so what analysis have you done on that?

A6) Not a huge amount on that, but this is self-reported. But we know piracy drops down where catch up and longer catch up windows are available – if content can be viewed legitimately and it seems that it is when available.

Q6 – follow up) Piracy seems essentially like product failure, and how do you win back your viewers and consumers.

A6 – HBG) A while back I saw a YouTube clip of the user experience of pirated film versus DVD… In that case the pirated film was easier, versus the trailers, reminders not to pirate etc. on the DVD. That’s your product problem. But as we move to subscription channels etc. When you make it easy, that’s a lot better. If you try to put barriers up, people try to find a way around it….

A6 – PG) Sweden has a large piracy issue. The way you compete is to deliver a great product and user experience and couple that with content unique to your channel. So for instance premium sports for example – so pirate can’t meet all needs of consumer. But also be realistic with price point.

A6 – HBG) There is a subtle difference between what you consume – e.g. film versus TV. But from music we know that pirating in the music industry is not a threat – that those are also purchasing consumers. And when content creators work with that, and allow some of that to happen, that creates engagement that helps. Most successful brand owners let others play with their brand.

A6 – PC) Piracy is an issue… But we even use piracy data sources for data analysis. Using bit torrent to understand popularity of shows in other places, to predict how popular they will be in the UK.

Comment – JO) So, pirates are data producers?

A6 – PC) Yes, and for scheduling too.

Q7) How are you dealing with cross channel or cross platform data – to work with Google or Amazon say. I don’t see much of that with linear TV. Maybe a bit with SVOD. How are mainstream broadcasters challenging that?

A7 – PC) Cross platform can mean different things. It may be Video On Demand as well as broadcast on their TV. We can’t assume they are different, and should look to understand what the connections are there… We are so conscious and cautious of using third party data… But we can do some content matching – e.g. advertiser customer base, and much more personalised. A real link between publisher and advertiser.

Q7 follow up) Would customer know that is taking place?

A7 – PC) It is an option at sign up. Many say “yes” to that question.

A7 – PG) We still have a lot to do to track the consumer across platforms, so a viewer can pick up consuming content from one platform to another. This technology is pretty immature, an issue with recommendation engines too.

A7 – SP) We do have relationships with third party data companies that augment what we collect – different from what a broadcaster would do. For this it tends to be non identifiable… BUt you have to trust the analyst to have combined data appropriately. You have to understand their method and process, but usually they have to infer from data anyway as usually don’t have source.

Q8 – JO) We were talking about unreliable technologies and opportunities… So, where do you see wearable technologies perhaps?

A8 – SP) We did some work using facial recognition to understand the usefulness of recommendations. That was interesting but deploying that comes with a lot of privacy issues. And devices etc. also would raise those issues.

A8 – PC) We aren’t looking at that sort of data… But data like weather matters for this industry, local events, traffic information – as context for consumption etc. That is all being considered as context for analysis. But we also share our data science with creative colleagues – that, say, technology will tell you when content is performed/shown. There is a subjective human aspect that they want to see, to dissect elements of content so machine can really learn… So is there sex involved… Who is the director, who is the actress… So many things you can put in the system to find this stuff out. Forecasting really is important in this industry.

A8 – HBG) The human element is interesting. Serendipity is interesting. From neuroscientist point of view I always worry about the act of measure… We see all the time that you can see the same audience, same demographic, watching the same content and reacting totally differently at different times of day etc. And live vs catch up say. My fear, and a great challenge, is how to get a neuroscience experiment valid in that context.

Q9 – from me) What happens if the data is not there in terms of content, or recommendation engines – if the data you have tells you there is a need for something you don’t currently have available. Are you using data science to inform production or content creation, or for advertising?

A9 – SP) The research we are currently doing is looking at ways to get much better data from viewers – trying things like a Tinder-like playful interface to really get a better understanding of what users want. But we also, whenever there are searches etc. capture not only what is available on that platform but also what is in demand but not yet available, and also provding details of that search iss to commissioning teams to inform what they do.

A9 – PG) There are some interesting questions about what is most valuable… So. you see Amazon Prime deciding on vale of Jeremy Clarkson and Top Gear team… And i think you will increasingly see purchasing based on data. And when it comesto commissioning we are looking to understand gaps in our portfolio.

A9 – PC) We are definitely interested in that. VOD is a proactive thing… YOu choose as a view… So we have an idea of micro genres that are specific to you… So we have say, Sex/Pervert corner; we have teenage american comedy; etc. and you can see how micro genres are panning out… And you can then telling commissioners what is happening on a video on demand side… BUt that’s different to commissioning for TV, and convincing that

A9 – HBG) I think that you’ve asked the single greatest question at a data science conference: what do you do if the data is not there. And sometimes you have to take a big leap to do something you can’t predict it… And that happens when you have to go beyond the possibilities of the data, and just get out there and do it.

A9 – SP) The concern is such that the data may start to reduce those leaps and big risks, and that could be a concern.

JO) And that’s a great point to finish on: that no matter how goos the data science we have to look beyond the data.

And after a break we are back… 

BBC – Keynote from Michael Satterthwaite, Senior Product Manager

I am senior project manager on a project called BBC Rewind. We have three projects looking at opportunities, especially around speech to text, from BBC Monitoring, BBC Rewind, and BBC News Labs. BBC Rewind is about maximising value from the BBC archive. But what does “value” mean? Well it can be about money, but I’m much more interested in the other options around value… Can we tell stories, can we use our content to improve people’s health… These are high level aims but we are working with the NHS, Dementia organisations, and running a hack event in Glasgow later this month with NHS, Dementia UK, Dementia Scotland etc. We are wondering if there is any way that we can make someone’s life better…

So, how valued is the BBC’s Archive? I’m told it’s immeasurable but what does that mean? We have content in a range of physical locations some managed by us, some by partners. But is that all valuable if it’s just locked away? What we’ve decided to do to ensure we do get value, is to see how we can extract that value.

So, my young niece, before she was 2 she’d worked out how to get into her mum’s ipad… And her dad works a lot in China, and has an iphone. In an important meeting he’d gotten loads of alerts… Turns out she’d worked out how to take photos of the ceiling and send them to him… How does this relate? Well my brother in law didn’t delete those pictures… And how many of us do delete our photos? [quick poll of the room: very very few delete/curate their digital images]

Storage has gotten so cheap that we have no need to delete. But at the BBC we used to record over content because of the costs of maintaining that content. That reflected the high price of storage – the episodes of Doctor Who taped over to use for other things. That’s a decision for an editor. But the price of storage has dropped so far that we can, in theory, keep everything from programmes to script and script notes, transcripts etc. Thats hard to look through now. Traditionally the solution is humans generating metadata about the content. But as we are now cash strapped and there is so much content… is that sustainable?

So, what about machines – and here’s my Early Learning Centre bit on Machine Learning… It involves a lot of pictures of pandas and a very confused room… to demonstrate a Panda and Not a Panda. When I do this presentation to colleagues in production they see shiny demos of software but don’t understand what the realistic expectations of that machine are. Humans are great at new things and intelligence, new problems and things like that…

Now part two of the demo… some complex maths… Computers are great at scale, at big problems. There is an Alan Turing quote here that seems pertinent, about it not being machine or humans, its finding ways for both to work together. And that means thinking about what machines are good at? Things like initial classification, scale, etc. What are humans good at? Things like classifying the most emotional moment in a talk. And we also need to think about how best we can use machines to complement humans.

But we also need to think about how good is good enough? If you are doing transcripts of an hour long programme, you want 100% or close enough and finish with humans. But if finding a moment in a piece of spoken word, you need to find the appropriate words for that search. That means your transcript might be very iffy but as long as it’s good enough to find those key entities. We can spend loads of time and money getting something perfect, when there is much more value in getting work to a level of good enough to do something useful and productive.

This brings me to BBC Rewind. The goal of this project is to maximise the value from the BBC Archives. We already have a lot of digitised content for lots of reasons – often to do with tape formats dying out and the need to build new proxies. And we are doing more digitising of selected parts of the BBC Archives. And we are using a mixture of innovative human and computer approaches to enrichment. And looking at new ways to use archives in our storytelling of audiences.

One idea we’ve tried is BBC Your Story which creates a biography based on your own life story, through BBC Archive content. It is incredibly successful as a prototype but we are looking at how we can put that into production, and make that more personalised.

We’ve also done some work on Timeline, and we wanted to try out semantic connections etc. but we don’t have all our content marked up as we would need so we did some hand mark up to try the idea out. My vision is that we want to reach a time when we can search for:

“Vladimir Putin unhappily shaking hands with Western Leaders in the rain at the G8, whilst expressing his happiness.” 

So we could break that into many parts requiring lots of complex mark up of content to locate suitable content.

At the moment BBC Rewind includes speech-to-text in English based on the Kaldi toolset – it’s maybe 45% accurate off the shelf – but that’s 45% more of the words than you had before, and a confidence value; Speech-to-text in the Welsh language; Voice identification; speaker segmentation – Speech recognition that identify speakers is nice, but we don’t need that just yet. And even if we did we don’t need that person to be named (a human can tag that easily) and then train algorithms off that; face recognition – is good but hard to scale, we’ve been doing some work with Oxford University in that area. And we get to context…. Brian Cox versus (Dr) Brian Cox can be disentangled with some basic contextual information.

Finally, we have an exciting announcement. We have BBC Monitoring – a great example of how we can use machines to help human beings in their monitoring media. So we will be creating tools to enable monitoring of media. In this project BBC are partnering with University of Edinburgh, UCL, Deutsche Welle and others in an EU funded Horizon 2020 project called SUMMA – this project has four workstreams and we are keen to make new partnerships

The BBC now runs tech hack events which resulted in new collaborations – including SUMMA – more hack events coming soon so contact Susanne Weber, Language Technology Producer in BBC News Labs. The first SUMMA hack event, will be end of next year and will focus on the automated monitoring of multi-media sources: audio-visual, text etc.

Lets try stuff faster and work out what works – and what doesn’t – more quickly!

Unlocking Value from Media Panel Session – Moderator: Simon King, University of Edinburgh

Our panel is…

Michael Satterthwaite – Senior Product Manager, BBC
Adam Farqhuar – Head of Digital Scholarship, British Library
Gary Kazantsev R&D Machine Learning Group, Bloomberg
Richard Callison – brightsolid (DC Thomson and Scottish Power joint initiative)

Q1 – SK) Lets start with that question of what value might be, if not financial?

A1 – GK) Market transparancy, business information – there are quantitative measures for some of these things. But a very hard problem in general.

A1 – AF) We do a lot of work on value in the UK, and economic impact, but we also did some work a few years back sharing digitised resources onto Flickr and that generated huge excitement and interest. That’s a great example of where you can create valuge by being open, rather than monetising early on.

A1 – MS) Understanding value is really interesting. Getty uses search to aid discovery and they have learned that you can use search to do that, to use the data you are capturing to ensure users access what they want and want to buy quickly. For us, with limited resources, the best way to understand value and impact is to try things out a bit, to see what works and what happens.

A1 – AF) Putting stuff out there without much metadata can give you some really great crowd data. With a million images we shared, our crowd identified maps from those materials. And that work was followed up with georeferencing those maps on the globe. So, even if you think there couldn’t possibly be enough of a community interested in doing this stuff, you can find that there really is that interest and who want to help…

A1 – MS) And you can use that to prioritise what you do next, what you digitise next, etc.

Q2 – SK) Which of the various formats of media are most difficult to do?

A2 – MS) Images are relatively straight forward but video is essentially 25 pictures per second… That’s a lot of content… That means sampling content else we’d crash even Amazon with the scale of work we have. And that sampling allows you to understand time, an aspect that makes video so tricky.

Q3 – SK) Is there a big difference between archive and current data…

A3 – RC) For me the value of content is often about extracting value from very local context, And it leads back to several things said earlier, about perhaps taking a leap of faith into areas the data doesn’t show, and which could be useful in the future… So we’ve done hand written data which was the only Census that was all handwritten – 32m rows of records on England and Wales and had to translate that to text… We just went offshore, the BPO outsourced… That was just a commercial project as we knew there was historical and genealogical interest… But not so many data sets like that around.

But working with the British Library we’ve done digitisation of newspapers both from originals and microfilm. OCR isn’t perfect but it gets it out there… The increase we have in multimedia online trigged by broadcast – Who Do You Think You Are? triggers huge interest in these services and we were in the right place at the right time to make that work.

A3 – GK) We are in an interesting position as Bloomberg creates it’s own data but we also ingest more than 1 million news documents in 30 languages from 120k sources. The Bloomberg newsroom started in 1990 and they had the foresight to collect clean clear digital data from the beginning of our work. That’s great for accessing, but extracting data is different… For some issues like semantic mark up and entity disambiguation… And huge issues of point in time correctness – named entities changing meanings over time. And unless someone encoded that into the information, then it is very difficult to disambiguate. And the value of this data, it’s role in trading etc., needs to be reliable.

I kind of don’t recognise Mike’s comments on video as there is object recognition available as an option… But I think we get more value out of text than most people, and we get real value from audience. Transcription and beyond… Entity recognition, dialogue structure, event extraction… A fairly long NLP pipeline there…

A3 – AF) The description of what you want to identify, those are very similar desires to those we want in the hunanities, and has additional benefit to journalists too. Is text search enough? Not really. They are an interesting way in… But text isn’t the best way to understand either historical images in a range of books, but also isn’t that useful in the context of the UK Web Archive and images in that. Much of what may be of interest is not the text, but perhaps better reduced to a series of shapes etc.

Q4) There has been a mention of crowd sourcing already and I was wondering about that experience, what worked and did not work, and thinking back to Mike’s presentation about what might work better?

A4 – AF) We found that smaller batches worked better… People love to see progress, like to have a sense of accomplishment. We found rewards were nice – we offered lunch with the head of maps at the British Library and that was important. Also mix it up – so not always the same super hard problems all the time

A4 – MS) I was going to give the BL example of your games machine… A mix of crowdsourcing and gamification.

A4 – AF) It’s very experimental but, as mentioned in the earlier panel session about the Tinder-like app. So we’ve worked with Adam Crimble to build an arcade game to do image classification and we are interested to see if people will use their time differently with this device. Will they classify images, help us build up our training sets. But the idea is that it’s engagement away from desktop or laptops…

A4 – RC) We have tried crowdsourcing for corrections. Our services tend to be subscriptions and Pay as You Go. But people still see value in contributing. And you can incentivise that stuff. And you see examples across the world where centrally or government websites are using crowd sourcing for transcription.

A4 – GK) You could argue that we were innovators in crowd sourcing at Bloomsberg, through blogs etc. And through tagging of entities. What we have learned from crowdsourcing is that it isn’t good for everything. But hard when specialist knowledge is needed, or specific languages needed – hard to get people to tag in Japanese. We aren’t opposed to paying for contribution but you have to set it up effectively. We found you have to define tasks very specifically for instance.

Q5) Talking about transposing to text implies that that is really possible. If we can’t do image descriptions effectively with text then what else should we be doing… I was wondering what the panel thought in terms of modalities of data…

A5 – MS) Whatever we do to mark up content is only as good as our current tools, understanding, modalities. And we’d want to go back and mark it up differently. In Google you can search for an image with an image… It’s changed over time… Now it uses text on the page to gather context and present that as well as the image back to you… If you can store a fingerprint to compare to others… We are doing visual searches. searches that are not text based. Some of these things already exist and they will get better and better. And the ability to scale and respond will be where the money is.

Q6) The discussion is quite interesting as at the moment it’s about value you define… But you could see the BBC as some form of commons… It could be useful for local value, for decision making, etc. where you are not in a positiion to declare the value… And there are lots of types of values out there, particularly in a global market.

A6 – MS) The BBC have various rules and regulations about publishing media, one of which is humans always have to check content and that is a real restriction on scale, particularly as we are looking to reduce staff. We ran an initiative called MCB with University of Edinburgh that opened some of the idea But ideally we would have every single minute of broadcast TV and radio into the public domain… But we don’t have the rights to everything… In many cases we acquired content before digital which means that you need to renegotiate content licenses etc. before digitising etc.

A6 – AF) Licenses can be an issue, privacy and data protection can be an issue. But we also have the challenge of how we meet user needs and actually listening to those needs. Someone we have to feel comfortable providing a lower level service, and may require higher skills (e.g. coding) to use… That can be something wonderful, not just super polished services required. But that has to be a service that is useful and valuable. But that’s super useful. And things will change in terms of what is useful, what is possible, etc.

A6 – GK) For us it’s an interesting question. Our users won’t say what they want, so you have to reverse engineer then do rapid product development… So we do what you (Micheal) suggest – building rapid prototypes to try ideas out. But this isn’t just a volatile time, but a volatile decade, more!

Q7) Can you tell us anything about how you manage the funnel for production, and how context is baked in in content creation process…

A7 – GK) There is a whole toolset for creating and encoding metadata, and doing so in a way meaningful to people beyond the organisation.. But I could talk about that for an hour so better to talk about this later I think.

Q8 – SK) How multilingual do you actually need to be in your work?

A8 – GK) We currently ingest content in 34 languages, but 10 languages cover the majority – but things changes quickly. Used to be 90% of content ingested was in English, now 70-80%. That’s a shift… We have not yet seen the case that suddenly lots of data that appears in a language where there was previously none. Instead we see particularly well resourced languages. Japanese is a large well resourced language and many resources in place, but very tricky from a computational perspective. And that can mean you still need humans.

A8 – MS) I probably have a different perspective on languages… We have BBC Research working in Africa with communities just going online for the first time. There are hundreds of new languages in Africa, but none will be a huge language… A few approaches… Can either translate directly, or you can convert into English, then translate from there. Some use speech to text – with Stephen Hawking type voice to provide continuity.

A8 – AF) Our collections cover all languages at all times… an increasingly difficult challenge.

Comment  – Susanne, BBC) I wanted to comment on speed of access to different language. All it takes is a catastrophe like an Ebola outbreak… Or disaster in Ukraine, or in Turkey… And you suddenly have the use case for ASR – machine translation. And you see audience expectations there.

A8 – MS) And you could put £1M into many languages and make little impact… But if you put that into one key language, e.g. Pashtu you might have more impact… We need to consider that in our funding and prioritisation.

A8 – GK) Yes, one disaster or event can make a big difference… If you provide the tools for them to access information and addt their own typing of their language… In the case of, say, Ebola you needed doctors speaking the language of the patient… But I’m not sure there is a technological solution. Similarly a case on the Amazon… Technology cannot always help here.

Q9) Do you have concerns that translations might be interpreted in different contexts and be misinterpreted? And the potential to get things massively wrong in another language. Do you have systems (human or machine) to deal with that?

A9 – AF) I won’t quite answer your question but a related thing… In some sense that’s the problem of data… Data becomes authoritative and unless we make it accessible, cite it, explain how it came about… Then it becomes authoritative. So we have large data collections being made available – BBC, BL etc. – and they can be examined in a huge set of new ways… They require different habits, tools, approaches than many of us are used to using, and different tools that e.g. academics in the humanities. And we need to emphasise the importance of proper citing, sharing, describing etc.

A9 – MS) I’d absolutely agree about transparency. Another of Susanne’s projects, Babel, is giving a rough translation that can then be amended. But an understanding of the context is so important.

A9 – GK) We had a query last week, in German, for something from Der Speigel… Got translated to The Mirror… But there is a news source called The Mirror… So translating makes sense… Except you need that outside data to be able to make sense of this stuff… It’s really an open question about where that should be and how you would do that.

Q10 – SK) So, a final question: What should ATI do in this space?

A10 – RC) For us we’d like to see what can be done on an SME level, and some product to go to market…

A10 – GK) I think that there are quite a lot of things that the ATI can do… I think there is a lot of stuff the industry won’t beat you too – the world is changing too rapidly for that. I think the University, the ATI should be better connected to industry – and I’ll talk about that tomorrow.

A10 – AF) As a national institution has a lot of data and content, but the question is how we can make sense of it… That large collection of data and content. The second issue is Skills – there is a lot to learn about data and working with large data collections. And thirdly there is convening… data and content, technologists, and researchers with questions to ask of the data and I think ATI can be really effective in bringing those people together.

A10 – MS) We were at an ideas hack day at the British Library a few weeks back and that was a great opportunity to get those people who create data, who research etc. and bringing it together. And I think ATI should be the holder of best practice to connect the holders of content, academia, etc. to work together to add value. For me trying to independently add value where it counts really makes a difference. For instance we are doing some Welsh speech to text work which is work I’m keen to share with others  in some way…

SK: Is there anything else that anyone here wants to add to the ATI to do list ?

Comment: I want to see us get so much better at multilingual support, the babelfish for all spoken languages ideally!

 

Closing Remarks – Steve Renals, Informatics, University of Edinburgh

I think today is something of a kick off for building relationships and we’ve seen some great opportunities today. And there will be more opportunity to do this over drinks as we finish for today.

And with that we are basically done, save for a request to hand in our badges in exchange for a mug – emblazoned with an Eduardo Paolazzi inspired by a biography of Alan Turing – in honour of Turing’s unusual attachment to his mug (which used to be chained to the radiator!).