Oct 132015
 
Michael Dewar, Data Scientist at The New York Times, presenting at the Data Science for Media Summit held by the Alan Turing Institute and University of Edinburgh, 14th October 2015..

Today I am at the “Data Science for Media Summit” hosted by The Alan Turing Institute & University of Edinburgh and taking place at the Informatics Forum in Edinburgh. This promises to be an event exploring data science opportunities within the media sector and the attendees are already proving to be a diverse mix of media, researchers, and others interesting in media collaborations. I’ll be liveblogging all day – the usual caveats apply – but you can also follow the tweets on #TuringSummit.

Introduction – Steve Renals, Informatics

I’m very happy to welcome you all to this data science for media summit, and I just wanted to explain that idea of a “summit”. This is one of a series of events from the Alan Turing Institute, taking place across the UK, to spark new ideas, new collaborations, and build connections. So researchers understanding areas of interest for the media industry. And the media industry understanding what’s possible in research. This is a big week for data science in Edinburgh, as we also have our doctoral training centre so you’ll also see displays in the forum from our doctoral students.

So, I’d now like to handover to Howard Covington, Chair, Alan Turing Institute

Introduction to the Alan Turing Institute (ATI) – Howard Covington, Chair, ATI

To introduce ATI I’m just going to cut to out mission, to make the UK the world leader in data science and data systems.

ATI came about from a government announcement in March 2014, then bidding process leading to universities chosen in Jan 2015, joint venture agreement between the partners (Cambridge, Edinburgh, Oxford, UCL, Warwick) in March 2015, and Andrew Blake, the institute’s director takes up his post this week. He was before now the head of research for Microsoft R&D in the UK.

Those partners already have about 600 data scientists working for them and we expect ATI to be an organisation of around 700 data scientists as students etc. come in. And the idea of the data summits – there are about 10 around the UK – for you to tell us your concerns, your interests. We are also hosting academic research sessions for them to propose their ideas. 

Now, I’ve worked in a few start ups in my time and this is going at pretty much as fast a pace as you can go.

We will be building our own building, behind the British Library opposite the Frances Crick building. There will be space at that HQ for 150 peaople. There is £67m of committed funding for the first 5 years – companies and organisations with a deep interest who are committing time and resources to the institute. And we will have our own building in due course.

The Institute sits in a wider ecosystem that includes: Lloyds Register – our first partner who sees huge amounts of data coming from sensors on large structures; GCHQ – working with them on the open stuff they do, and using their knowledge in keeping data safe and secure; EPSRC – a shareholder and partner in the work. We also expect other partners coming in from various areas, including the media.

So, how will we go forward with the Institute? Well we want to do both theory and impact. So we want major theoretical advances, but we will devote time equally to practical impactful work. Maths and Computer Science are both core, but we want to be a broad organisation across the full range of data science, reflecting that we are a national centre. But we will have to take a specific interest in particular interest. There will be an ecosystem of partners. And we will have a huge training programme with around 40 PhD students per year, and we want those people to go out into the world to take data sciences forward.

Now, the main task of our new director, is working out our science and innovation strategy. He’s starting by understanding where our talents and expertise already sit across our partners. We are also looking at the needs of our strategic partners, and then the needs emerging from the data summits, and the academic workshops. We should then soon have our strategy in place. But this will be additive over time.

When you ask someone what data science is that definition is ever changing and variable. So I have a slide here that breaks the rules of slide presentations really, in that it’s very busy… But data science is very busy. So we will be looking at work in this space, and going into more depth, for instance on financial sector credit scoring; predictive models in precision agriculture; etc. Undercutting all of these is similarities that cross many fields. Things like security and privacy is one such area – we can only go as far as it is appropriate to go with people’s data, and issue both for ATI and for individuals.

I don’t know if you think that’s exciting, but I think it’s remarkably exciting!

We have about 10 employees now, we’ll have about 150 this time next year, and I hope we’ll have opportunity to work with all of you on what is just about the most exciting project going on in the UK at the moment.

And now to our first speaker…

New York Times Labs – Keynote from Mike Dewar, Data Scientist

I’m going to talk a bit about values, and about the importance of understanding the context of what it is we do. And how we embed what we think is important into the code that we write, the systems that we design and the work that we do.

Now, the last time I was in Edinburgh was in 2009 I was doing a Post Doc working on modelling biological data, based on video of flies. There was loads of data, mix of disciplines, and we were market focused – the project became a data analytics company. And, like much other data science, it was really rather invasive – I knew huge amounts about the sex life of fruit flies, far more than one should need too! We were predicting behaviours, understanding correlations between environment and behaviour. I’

I now work at the New York Times R&D and our task is to look 3-5 years ahead of current NYT practice. We have several technologists there, but also colleagues who are really designers. That has forced me up a bit… I am a classically trained engineer – to go out into the world, find the problem, and then solve it by finding some solution, some algorithm to minimise the cost function. But it turns out in media, where we see decreasing ad revenue, and increasing subscription, that we need to do more than minimise the cost function… That basically leads to click bait. So I’m going to talk about three values that I think we should be thinking about, and projects within that area. So, I shall start with Trust…

Trust

It can be easy to forget that much of what we do in journalism is essentially surveillance, so it is crucial that we do our work in a trustworthy way.

So the first thing I want to talk about is a tool called Curriculum, a Chrome browser plug in that observes everything I read online at work. Then it takes chunk of text, aggregates with what others are reading, and projects that onto a screen in the office. So, firstly, the negative… I am very aware I’m being observed – it’s very invasive – and that layer of privacy is gone, that shapes what I do (and it ruins Christmas!). But it also shares what everyone is doing, a sense of what collectively we are working on… It is built in such a way as to make it inherently trustworthy in four ways: it’s open source so I can see the code that controls this project; it is fantastically clearly written and clearly architected so reading the code is actually easy, it’s well commented, I’m able to read it; it respects existing boundaries on the web – it does not read https (so my email is fine) and respects incognito mode; and also I know how to turn it off – also very important.

In contrary to that I want to talk about Editor. This is a text editor like any others… Except whatever you type is sent to a series of micro services which looks for similarity, looking for NYT keyword corpos, and then sends that back to the editor – enabling a tight mark up of their text. The issue is that the writer is used to writing alone, then send to production. Here we are asking the writer to share their work in progress and send it to central AI services at the NYT, so making that trustworthy is a huge challenge, and we need to work out how best to do this.

Legibility

Data scientists have a tendency towards the complex. I’m no different – show me a new tool and I’ll want to play with it and I enjoy a new toy. And we love complex algorithms, especially if we spent years learning about those in grad school. And those can render any data illegible.

So we have [NAME?] an infinite scrolling browser – when you scroll you can continue on. And at the end of each article an algorithm offers 3 different recommendation strands… It’s like a choose your own adventure experience. So we have three recommended articles, based on very simple recommendation engine, which renders them legible. These are “style graph” – things that are similar in style; “collaborative filter” – readers like you also read; “topic graph” – similar in topic. These are all based on the nodes and edges of the connections between articles. They are simple legible concepts, and easy to run so we can use them across the whole NYT corpus. They are understandable to deal with so has a much better chance of resonating with our colleagues.

As a counter point we were tasked with looking at Behavioural Segmentation – to see how we can build different products for them. Typically segmentation is done with demography. We were interested, instead, on using just the data we had, the behavioural data. We arranged all of our pageviews into sessions (arrive at a page through to leave the site). So, for each session we representated the data as a transition matrix to understand the probability of moving from one page to the next… So we can perform clustering of behaviours… So looking at this we can see that there are some clusters that we already know about… We have the “one and dones” – read one article then move on. We found the “homepage watcher” who sit on the homepage and use that as a launching point. The rest though the NYT didn’t have names for… So we now have the “homepage bouncer” – going back and forth from the front page; and the “section page starter” as well, for instance.

This is a simple caymeans (?) model and clustering, very simple but they are dynamic, and effective. However, this is very very radical at NYT, amongst non data scientist. It’s hard to make it resonate to drive any behaviour or design in the building. We have a lot of work to do to make this legible and meaningful for our colleagues.

The final section I want to talk about is Live…

Live

In news we have to be live, we have to work in the timescales of seconds to a minute. In the lab that has been expressed as streams of data – never ending sequences of data arriving at our machines as quickly as possible.

So, one of our projects, Delta, produces a live visualisation of every single page views of the NYT – a pixel for person starting on the globe, then pushing outwards, If you’ve visited the NYT in the last year or so, you’ve generated a pixel on the globe in the lab. We use this to visualise the work of the lab. We think the fact that this is live is very visceral. We always start with the globe… But then we show a second view, using the same pixels in the context of sections, of the structure of the NYT content itself. And that can be explored with an XBox controller. Being live makes it relevant and timely, to understand current interests and content. It ties people to the audience, and encourages other parts of the NYT to build some of these live experiences… But one of the tricky things of that is that it is hard to use live streams of data, hence…

Streamtools, a tool for managing livestreams of data. It should be reminscent of Similink or LabView etc. [when chatting to Mike earlier I suggested it was a superpimped, realtime Yahoo Pipes and he seemed to agree with that description too]. It’s now on it’s third incarnation and you can come and explore a demo throughout today.

Now, I’ve been a data scientist and involved when we bring our systems to the table we need to be aware that what we build embodies our own values. And I think that for data science in media we should be building trustworthy systems, tools which are legible to others, and those that are live.

Find out more at nytlabs.com.

Q&A

Q1) I wanted to ask about expectations. In a new field it can be hard to manage expectations. What are your users expectations for your group and how do you manage that?

A1) The expectations in R&D, in which we have one data scientist and a bunch of designers. We make speculative futures, build prototypes, bring them to NYT, to the present, to help them make decisions about the future. In terms of data science in general at NYT… Sometimes things look magic and look lovely but we don’t understand how they work, in other places it’s much simpler, e.g. counting algorithms. But there’s no risk of a data science winter, we’re being encouraged to do more.

Q2) NYT is a paper of record, how do you manage risk?

A2) Our work is informed by a very well worded privacy statement that we respect and build our work on. But the other areas of ethics etc. is still to be looked at.

Q3) Much of what you are doing is very interactive and much of data science is about processing large sets of data… So can you give any tips for someone working with Terrabytes of data for working with designers?

A3) I think a data scientist essentially is creating a palate of colours for your designer to work with. And forcing you to explain that to the designer is useful, and enables those colours to be used. And we encourage that there isn’t just one solution, we need to try many. That can be painful as a data scientist as some of your algorithms won’t get used, but, that gives some great space to experiment and find new solutions.

Data Journalism Panel Session moderated by Frank O’Donnell, Managing Editor of The Scotsman, Edinburgh Evening News and Scotland on Sunday

We’re going to start with some ideas of what data journalism is

Crina Boros, Data Journalist, Greenpeace

I am a precision journalist.  and I have just joined Greenpeace having worked at Thomson Reuters, BBC Newsnight etc. And I am not a data scientist, or a journalist. I am a pre-journalist working with data. At Greenpeace data is being used for investigate journalism purposes, areas no longer or rarely picked up by mainstream media, to find conflicts of interest, and to establish facts and figures for use in journalism, in campaigning. And it is a way to protect human sources and enable journalists in their work. I have, in my role, both used data that exists, created data when it does not exist. And I’ve sometimes worked with data that was never supposed to see the light of data.

Evan Hensleigh, Visual Data Journalist, The Economist

I was originally a designer and therefore came into information visualisation and data journalism by a fairly convoluted route. At the Economist we’ve been running since the 1890s and we like to say that we’ve been doing data science since we started. We were founded at the time of the Corn Laws in opposition to those proposals, and visualised the impact of those laws as part of that.

The way we now tend to use data is to illustrate a story we are already working on. For instance working on articles on migration in Europe, and looking at fortifications and border walls that have been built over the last 20 to 30 years lets you see the trends over time – really bringing to life the bigger story. It’s one thing to report current changes, but to see that in context is powerful.

Another way that we use data is to investigate changes – a colleague was looking at changes in ridership on the Tube, and the rise of the rush hour – and then use that to trigger new articles.

Rachel Schutt, Chief Data Scientist, Newscorp

I am not a journalist but I am the Chief Data Scientist at Newscorp, and I’m based in New York. My background is a PhD in statistics, and I used to work at Google in R&D and algorithms. And I became fascinated by data science so started teaching an introductory course at Columbia, and wrote a book on this topic. And what I now do at Newscorp is to use data as a strategic asset. So that’s about using data to generate value – around subscriptions, advertising etc. But we also have data journalism so I increasingly create opportunities for data scientists, engineers, journalists, and in many cases a designer so that they can build stories with data at the core.

We have both data scientists, but also data engineers  – so hybrid skills are around engineering, statistical analysis, etc. and sometimes individual’s skills cross those borders, sometimes it’s different people too. And we also have those working more in design and data visualisation. So, for instance, we are now getting data dumps – the Clinton emails, transcripts from Ferguson etc. – and we know those are coming so can build tools to explore those.

A quote I like is that data scientists should think like journalists (from DJ Patel) – in any industry. In Newscorp we also get to learn from journalists which is very exciting. But the idea is that you have to be investigative, be able to tell a story, to

Emily Bell says “all algorithms are editorial” – because value judgements are embedded in those algorithms, and you need to understand the initial decisions that go with that.

Jacqui Maher, Interactive Journalist, BBC News Labs
I was previously at the NYT, mainly at the Interactive News desk in the newsroom. An area crossing news, visualisation, data etc. – so much of what has already been said. And I would absolutely agree with Rachel about the big data dumps and looking for the story – the last dump of emails I had to work with were from Sarah Palin for instance.

At the BBC my work lately has been on a concept called “Structured Journalism” – so when we report on a story we put together all these different entities in a very unstructured set of data as audio, video etc. Many data scientists will try to extract that structure back out of that corpus… So we are looking at how we might retain the structure that is in a journalist’s head, as they are writing the story. So digital tools that will help journalists during the investigative process. And ways to retain connections, structures etc. And then what can we do with that… What can make it more relevant to readers/viewers – context pieces, ways of adding context in a video (a tough challenge).

If you look at work going on elsewhere, for instance at the Washington Post working on IS, are looking at how to similarly add context, how they can leverage previous reporting without having to do that from scratch.

Q&A/Discussion

Q1 – FOD) At a time when we have to cut staff in media, in newspapers in particular, how do we justify investing in data science, or how do we use data science.

A1 – EH) Many of the people I know came out of design backgrounds. You can get pretty far just using available tools. There are a lot of useful tools out there that can help your work.

A1 – CB) I think this stuff is just journalism, and these are just another sets of tools. But there is a misunderstanding that you don’t press a button and get a story. You have to understand that it takes time,  there’s a reason that it is called precision journalism. And sometimes the issue is that the data is just not available.

A1 – RS) Part of the challenge is about traditional academic training and what is and isn’t included here.. But there are more academic programmes on data journalism. It’s a skillset issue. I’m not sure that, on a pay basis, whether data journalists should get paid more than other journalists…

A1 – FOD) I have to say in many newsrooms journalists are not that numerate. Give them statistics, even percentages and that can be a challenge. It’s almost a badge of honour as wordsmiths…

A1 – JM) I think most newsrooms have an issue of silos. You also touched on the whole “math is hard” thing. But to do data journalism you don’t need to be a data scientist. They don’t have to be an expert on maths, stats, visualisation etc. At my former employer I worked with Mike – who you’ve already heard from – who could enable me to cross that barrier. I didn’t need to understand the algorithms, but I had that support. You do see more journalist/designer/data scientists working together. I think eventually we’ll see all of those people as journalists though as you are just trying to tell the story using the available tools.

Q2) I wanted to ask about the ethics of data journalism. Do you think that to do data journalism there is a developing field of ethics in data journalism?

A1 – JM) I think that’s a really good question in journalism… But I don’t think that’s specific to data journalism. When I was working at NYT we were working on the Wikileaks data dumps, and there were huge ethical issues there and around the information that was included there in terms of names, in terms of risk. And in the end the methods you might take – whether blocking part of a document out – the technology mignt vary but the ethical issues are the same.

Q2 follow up FOD) And how were those ethical issues worked out?

A1 – JM) Having a good editor is also essential.

A1 – CB) When I was at Thomson Reuters I was involved in running womens rights surveys to collate data and when you do that you need to apply research ethics, with advice from those appropriately positioned to do that.

A1 – RS) There is an issue that traditionally journalists are trained in ethics but data scientists are not trained in ethics. We have policies in terms of data privacy… But much more to do. And it comes down to the person who is building a data model – ad you have to be aware of the possible impact and implications of that model. And risks also of things like the Filter Bubble (Pariser 2011).

Q3 – JO) One thing that came through listening to ? and Jackie, it’s become clear that journalism is a core part of journalism… You can’t get the story without the data. So, is there a competitive advantage to being able to extract that meaning from the data – is there a data science arms race here?

A3 – RS) I certainly look out to NYT and other papers I admire what they do, but of course the reality is messier than the final product… But there is some of this…

A3 – JM) I think that if you don’t engage with data then you aren’t keeping up with the field, you are doing yourself a professional misservice.

A3 – EH) There is a need to keep up. We are a relatively large group, but nothing like the scale of NYT… So we need to find ways to tell stories that they won’t tell, or to have a real sense of what an Economist data story looks like. Our team is about 12 or 14, that’s a pretty good side.

A3 – RS) Across all of our businesses there are 100s in data science roles, of whom only a dozen or so are on data journalism side.

A3 – JM) At the BBC there are about 40 or 50 people on the visual journalism team. But there are many more in data science in other roles, people at the World Service. But we have maybe a dozen people in the lab at any given moment.

Q4) I was struck by the comment about legibility, and a little bit related to transparancy in data. Data is already telling a story, there is an editorial dimension, and that is added to in the presentation of the data… And I wonder how you can do that to improve transparancy.

A4 – JM) There are many ways to do that… To show your process, to share your data (if appropriate). Many share code on GitHub. And there is a question there though – if someone finds something in the data set, what’s the feedback loop.

A4 – CB) In the past where I’ve worked we’ve shared a document on the step by step process used. I’m not a fan of sharing on GitHub, I think you need to hand hold the reader through the data story etc.

Q5) Given that journalims is about holding companies to account… In a world where, e.g. Google, are the new power brokers, who will hold them to account. I think data journalism needs a merge between journalism, data science, and designers… Sometimes that can be in one person… And what do you think about journalism playing a role in holding new power brokers to account.

A5 – EH) There is a lot of potential. These companies publish a lot of data and/or make their data available. There was some great work on 5:38 about Uber, based on a Freedom of Information request to essentially fact check Uber’s own statistics and reporting of activities.

Q6) Over the years we’ve (Robert Gordan Univerity) worked with journalists from various organisations. I’ve noticed that there is an issue, not yet raised, that journalists are always looking for a particular angle in data as they work with it… It can be hard to get an understanding from the data, rather than using the data to reinforce bias etc.

A6 – RS) If there is an issue of taking a data dump from e.g. Twitter to find a story… Well dealing with that bias does come back to training. But yes, there is a risk of journalists getting excited, wanting to tell a novel story, without being checked with colleagues, correcting analysis.

A6 – CB) I’ve certainly had colleagues wanting data to substantiate the story, but it should be the other way around…

Q6) If you, for example, take the Scottish Referendum and the General Election and you see journalists so used to watching their dashboard and getting real time feedback, they use them for the stories rather than doing any real statistical analysis.

A6 – CB) That’s part of the usefulness of reason for reading different papers and different reporters covering a topic – and you are expected to have an angle as a journalist.

A6 – EH) There’s nothing wrong with an angle or a hunch but you also need to use the expertise of colleagues and experts to check your own work and biases.

A6 – RS) There is a lot more to understand how the data has come about, and people often use the data set as a ground truth and that needs more thinking about. It’s somewhat taught in schools, but not enough.

A6 – JM) That makes me think of a data set called gdump(?), which captures media reporting and enables event detection etc. I’ve seen stories of a journalist looking at that data as a canonical source for all that has happened – and that’s a misunderstanding of how that data set has been collected. It’s close to a canonical source for reporting but that is different. So you certainly need to understand how the data has come about.

Comment – FOD) So, you are saying that we can think we are in the business of reporting fact rather than opinion but it isn’t that simple at all.

Q7) We have data science, is there scope for story science? A science and engineering of generating stories…

A7 – CB) I think we need a teamwork sort of approach to story telling… With coders, with analysts looking for the story… The reporters doing field reporting, and the data vis people making it all attractive and sexy. That’s an ideal scenario…

A7 – RS) There are companies doing automatic story generation – like Narrative Science etc. already, e.g. on Little League matches…

Q7 – comment) Is that good?

A7 – RS) Not necessarily… But it is happening…

A7 – JM) Maybe not, but it enables story telling at scale, and maybe that has some usefulness really.

Q8/Comment) There was a question about the ethics and the comment that nothing needed there, and the comment about legibility. And I think there is conflict there about

Statistical databases  – infer missing data from the data you have, to make valid inferences but could shock people because they are not actually in the data (e.g. salary prediction). This reminded me of issues such as source protection where you may not explicitly identify the source but that source could be inferred. So you need a complex understanding of statistics to understand that risk, and to do that practice appropriately.

A8 – CB) You do need to engage in social sciences, and to properly understand what you doing in terms of your statistical analysis, your P values etc. There is more training taking place but still more to do.

Q9 – FOD) I wanted to end by coming back to Howard’s introduction. How could ATI and Edinburgh help journalism?

A9 – JM) I think there are huge opportunities to help journalists make sense of large data sets. Whether that is tools for reporting or analysis. There is one, called Detector.io that lets you map reporting for instance that is shutting down and I don’t know why. There are some real opportunities for new tools.

A9 – RS) I think there are areas in terms of curriculum, on design, ethics, privacy, bias… Softer areas not always emphasised in conventional academic programmes but are at least as important as scientific and engineering sides.

A9 – EH) I think generating data from areas where we don’t have it. At the economist we look at China, Asia, Africa where data is either deliberately obscured or they don’t have the infrastructure to collect it. So tools to generate that would be brilliant.

A9 – CB) Understand what you are doing; push for data being available; and ask us and push is to be accountable, and it will open up…

Q10) What about the readers. You’ve been saying the journalists have to understand their stats… But what about the readers who know how to understand the difference between reading the Daily Mail and the Independent, say, but don’t have the data literacy to understand the data visualisation etc.

A10 – JM) It’s a data literacy problem in general…

A10 – EH) Data scientists have the skills to find the information and raise awareness

A10 – CB) I do see more analytical reporting in the US than in Europe. But data isn’t there to obscure anything. But you have to explain what you have done in clear language.

Comment – FOD) It was once the case that data was scarce, and reporting was very much on the ground and on foot. But we are no longer hunter gatherers in the same way… Data is abundant and we have to know how we can understand, process, and find the stories from that data. We don’t have clear ethical codes yet. And we need to have a better understanding of what is being produced. And most of the media most people consume is the local media – city and regional papers – and they can’t yet afford to get into data journalism in a big ways. Relevance is a really important quality. So my personal challenge to the ATI is: how do we make data journalism pay?

And we are back from lunch and some excellent demos… 

Ericsson, Broadcast & Media Services – Keynote from Steve Plunkett, CTO

Jon Oberlander is introducing Steve Plunkett who has a rich history of work in the media. 

I’m going to talk about data and audience research, and trends in audience data. We collect and aggregate and analyse lots of data and where many of the opportunities are…

24,000 R&D very much focused on telecoms. But within R&D there is a group of broadcast and media services, and I joined as part of a buy out of Red Bee Media. One part of these services are a metadata team who create synposes for EPGs across Europe (2700 channels). We are also the biggest subtitlers in Europe. And we also do media management – with many hundreds of thousands of hours of audio and tv and that’s also an asset we can analyse (the inventory as well as the programme). And we operate TV channels – all BBC, C4, C5, UKTV, France, Netherlands, and in US and our scheduling work is also a source of data. And we also run recommendation engines embedded in TV guides and systems.

Now, before I tak about the trends I want to talk about the audience. Part of the challenge is understanding who the audience is… And audiences change and the rate of change is accellerating. So I’ll show some trends in self-reported data from audiences on what they are watching. Before that a quote from Reed Hastings, Amazon: “TV had a great 50 year run, but now it’s time is over”. TV is still where most impact and viewing hours are but there are real changes now.

So, the Ericsson ConsumerLab Annual Report – participants across the world – 1000 consumers across 20 countries. In home interview based understanding their viewing context, what they are watching and what preferences are. Of course self reported behaviour isn’t the same as real data but we can compare and understand that.

So, the role of services varies between generations. The go-to services are very different between older generations and younger generation. For older viewers it’s linear TV, then DVR, then Play/catch-ip, then YouTube etc. For Younger Generations SVOD is top viewing services – that’s things like Netflix, Amazon Prime etc.

In terms of daily media habits we see again a real difference between use of scheduled linear TV vs. streamed and recorded TV. Younger people again much more likely to use streaming, older using scheduled much more. And we are seeing YouTube growing in importance – generally viewing over 3 hrs per day has increased hugely in the last 4 years, and it is used as a go to space to learn new things (e.g. how to fix the dishwasher).

In terms of news the importance of broadcast news increases with age – still much more important to older consumers. And programming wise 45% of streamed on demand viewing of long content is TV series. Many watch box sets for instance. As broadcasters we have to respect that pattern of use, not all are linear scheduled viewers. And you see this in trends of tweeting and peaks of tweaks of how quickly a newly released online series has been completed.

There is also a shift from fixed to mobile devices. TV Screens and desktop PCs have seen a reduction in viewing hours and use compared to mobile, tablet and laptop use. That’s a trend overtime. And that’s again following generational lines… Younger people more likely to use mobile. Now again, this is self-reported and can vary between countries. So in our broadcast planning understanding content – length of content, degree of investment in High Def etc. – should be informed by those changes. On mobile user generated content – including YouTube but also things like Periscope – still dominant.

In terms of discovering and remembering content it is still the case that friends, reviews, trailers etc. matter. But recommendation engines are important and viewers are satisfied with them. For last two years we’ve asked study group about those recommedation engines: their accuracy; their uncanniness and data and privacy concerns; and an issue of shared devices. So still much more to be done. The scale of Netflix’ library is such that recommendations are essential to help users navigate.

So, that was self-reported. What about data we create and collect?

We have subtitle coverage, often doing the blanket subtitle coverage for broadcasters. We used to use transcribers and transcription machines. We invested in respeaking technologies. And that’s what we use now and those respeakers clean up grammar etc and the technology is trained for their voice. That process of logging subtitles includes very specific timestamps… That gives us rich new data, and also creates a transcript that can sit alongside the subtitles and programme. But it can take 6-7 hours to do subtitling as a whole process, including colour coding speakers etc. And we are looking to see what else subtitlers could add – mood perhaps? etc. as part of this process.

We have a database of about 8.5 million records that include our programme summaries, images on an episode level, etc. And we are working on the system we use to manage this, to improve it.

I mentioned Media Management and we do things like automated transcription – it wouldn’t be good enough for use in broadcast but

Media RIM – 60 telecom operators use it for IPTV and collects very granular data from TV viewing – all collected with consent. Similar for OTT. And similar platforms for EPG. Search queries. Recommendations and whether acted upon. And we also have mobile network data – to understand drop off rates, what’s viewed for a particular item etc.

We are in the middle of the broadcaster and the audience, so our work feeds into broadcasters work. For insight like segmentation, commissioning, marketing, scheduling, sales… For personalisation – content recommendations, personalised channels that are unique to you, targeted advertising, search, content navigation, contextual awareness. One of the worst feedback comments we see is about delivery quality so when it comes to delivery quality we apply our data to network optimisation etc.

In terms of the challenges we face they include: consumer choice; data volumes – and growing fast so finding value matters; data diversity – very different in structure and form so complex task; expertise – there is a lack of skills embedded in these businesses to understand our data; timeliness – personal channels need fast decisions etc. real time processing is a challenge; privacy – one of the biggest ones here, and the industry needs to know how to do that and our feedback on recommendation engines is such that we need to explain where data is coming from, to make that trusted.

In terms of opportunities: we are seeing evolving technology; cloud resources are changing this fast; investment – huge in this area at the moment; consumer appetite for this stuff; and we are in an innovation white space right now – we are in early days…

And finally… An experimental application. We took Made in Chelsea and added a graph on the viewing plan that shows tweets and peaks… And provide as a navigation system based on tweets shared. And on the right hand side navigation by character and follow their journey. We created some semantic visualisation tools for e.g. happy, sad, funny moments. Navigation that focuses on the viewers interest.

Audience Engagement Panel Session – Jon Oberlander (Moderator), University of Edinburgh

Jon is introducing his own interest in data science, in design informatics, and linguistics and data science, with a particular mention for LitLong, similarly a colleague in Politics is analysing the public interest in the UK and EU, but also reaction to political messages. And finally on the Harmonium project at the Edinburgh International Festival – using music and data on musical performers to create a new music and visualisation project, with 20k in person audience and researchers monitoring and researching that audience on the night too…

Pedro Cosa – Data Insights and Analytics Lead, Channel 4

I’m here to talk a bit about the story of Channel 4 and data. Channel 4 is a real pioneer in using data in the UK, and in Europe. You’ve all heard Steve’s presentation on changing trends – and these are very relevant for Channel 4 as we are a public service broadcasting but also because our audience is particularly young and affluent. They are changing their habits quickly and that matters from an audience and also an advertising issue for us. Senior management was really pushing for change in the channel. Our CEO has said publicly that data is the new oil of the TV industry and he has invested in data insights for Channel 4. The challenge is to capture as much data as possible, and feed that back to the business. So we used registration data from All4 (was 4OD) and to use that site you have to register. We have 13 million people registered that way and so that’s already capturing details on half our target audience in the UK. And that moves us from one to many, to one to one. And we can use that for targeted advertising, and that comes with a premium paid for advertisers, and to really personalise the experience. So that’s what we are doing at the moment.

Hew Bruce-Gardyne – Chief Technology Officer, TV Squared

We are a small company working on data analytics for use by advertisers, that in turn feed back into content. My personal background is as an engineer, the big data of that side of number crunching is where I come from. From where I am sitting audience engagement is a really interesting problem… If you see a really big engaging programme that seems to kill the advertising so replays, catch up and seeing opportunities there is, for us, gold dust.

Paul Gilooly – Director of Emerging Products, MTG (Modern Times Group)

MTG are a Scandinavian pan-European broadcaster, we have the main sports and Hollywood rights as well as major free to air channels in Scandinavian countries. And we run a thing called ViPlay which is an SVOD service like (and predating) Netflix. Nordics are interest as we have high speed internet, affluent viewers, markets where Apple TV is significant, disproportionately compared to the rest of Europe. So when I think of TV I think of subscribing audience, and Pay TV. And my concern is churn – and a amore engaged customer is more likely to stick around. So any way to increase engagement is of interest, and data is a key part of that. Just as Channel 4 are looking at authentication as a data starting point, so are we. And we also want to encourage behaviours like recommendations of products and sharing. And some behaviours to discourage. And data is also the tool to help you understand behaviours you want to discourage.

For us we want to increase transactions with viewers, to think more like a merchandiser, to improve personalisation… So back to the role of data – it is a way to give us a competitive advantage over competitors, can drive business models for different types of consumer. It’s a way to understand user experience, quality of user experience, and the building of personalised experiences. And the big challenge for me is that in the Nordics we compete with Netflix, with HBO (has direct to air offering there). But we are also competing with Microsoft, Google, etc. We are up against a whole new range of competitors who really understand data, and what you can do with data.

Steve Plunkett – CTO, Broadcast & Media Services, Ericsson

No intro… as we’ve just heard from you… 

Q&A

Q1 – JO) Why are recommendations in this sector so poor compared to e.g. Amazon?

A1 – SP) The problem is different. Amazon has this huge inventory, and collective recommendation works well. Our content is very different. We have large content libraries, adn collective recommendation works differntly. We used to have human curators programming content, they introduced serendipity nad recommendation engines are less good at that. We’ve just embarked on a 12 month project with three broadcasters  to look at this. There is loads of research on public top 10s. One of the big issues is that if you get a bad recommendation it’s hard to say “I don’t like this” or “not now”, they just sit there and the feedback is poor… So important to solve. Netflix invested a great deal of money in recommendations. They invested $1 million for a recommender that would beat their own by 10% and that took a long time. Data science is aligned with that of course.

A1 – PC) Recommendations are core for us too. But TV recommendations are so much more complex than retail… You need to look at data analyse… You have to promote cleverly, to encourage discovery, to find new topics or areas of debate, things you want to surface in a relevant way. It’s an area C4 and also BBC looking to develop.

A1 – HBG) There is a real difference between retail and broadcast – about what you do but also about the range of content available… So even if you take a recommendation, it may not reflect true interest and buy in to a product. Adds a layer of complexity and cloudiness…

A1 – SP) Tracking recommendations in a multi device, multi platform space is a real challenge… Often a one way exchange. Closing loop between recommendation and action is hard…

Q2 – JO) Of course you could ask active questions… Or could be mining other streams… How noisy is that, how useful is that? Does it bridge a gap.

A2 – SP) TV has really taken off on Twitter, but there is disproportionate noise based on a particular audience and demographic. That’s a useful tool though… You can track engagement with a show, at a point of time within a show… But not neccassarily recommendations of that viewer at that time… But one of many data sets to use…

Q3 – JO) Are users engaging with your systems aware of how you use their data, are they comfortable with it?

A3 – PC) C4 we have made a clear D-Word promise – with a great video from Alan Carr that explains that data. You can understand how it is use, can delete your own data, can change your settings, and if you don’t use the platform for 2 years then we delete your data. Very clear way to tell the user that you are in control.

A3 – SP) We had a comment from someone in a study group who said they had been categorised by a big platform as a fan of 1980s supernatural horror and didn’t want to be categorised in that way, or for others to see this. So a real interest in transparancy there.

A3 – PG) We aren’t as far ahead as Channel 4, they are leading the way on data and data privacy.

Q4 – JO) Who is leading the way here?

A4 – PG) I think David Abrahms (C4) needs great credit here, CEO understands importance of data science and it’s role in the core business model. And that competitors for revenue are Facebook, Google and so forth.

Q5 – JO) So, trend is to video on demand… Is it also people watching more?

A5 – SP) It has increased but much more fragmented across broadcast, SVOD, UGC etc. and every type of media has to define its space. So YouTube etc. is eating into scheduled programming. For my 9 year old child the streaming video, YouTube etc. is her television. We are competing with a different set of producers.

A5 – PG) The issue isn’t that linear channels do not allow you to collect data. If you have to login to access content (i.e. Pay TV) then you can track all of that sort of data. So DR1, Danish TV channel and producer of The Killing etc. is recording a huge drop in linear viewing by young people, but linear still has a role for live events, sport etc.

A5 – HBG) We do see trends that are changing… Bingeathons are happening and that indicates not a shortness of attention but a genuine change. Watching a full box set is the very best audience engagement. But if you are at a kitchen table, on a device, that’s not what you’ll be watching… It will be short videos, YouTube etc.

To come back to the privacy piece I was at a conference talking about the push to ID cards and the large move to restrict what people can know about us… We may lose some of the benefits of what can be done. And on some data – e.g. Medical Informatics – there is real value that can be extracted there. We know that Google knows all about us… But if our TV knows all about us that’s somehow culturally different.

Q6) Privacy is very high, especially at younger age ranges, so what analysis have you done on that?

A6) Not a huge amount on that, but this is self-reported. But we know piracy drops down where catch up and longer catch up windows are available – if content can be viewed legitimately and it seems that it is when available.

Q6 – follow up) Piracy seems essentially like product failure, and how do you win back your viewers and consumers.

A6 – HBG) A while back I saw a YouTube clip of the user experience of pirated film versus DVD… In that case the pirated film was easier, versus the trailers, reminders not to pirate etc. on the DVD. That’s your product problem. But as we move to subscription channels etc. When you make it easy, that’s a lot better. If you try to put barriers up, people try to find a way around it….

A6 – PG) Sweden has a large piracy issue. The way you compete is to deliver a great product and user experience and couple that with content unique to your channel. So for instance premium sports for example – so pirate can’t meet all needs of consumer. But also be realistic with price point.

A6 – HBG) There is a subtle difference between what you consume – e.g. film versus TV. But from music we know that pirating in the music industry is not a threat – that those are also purchasing consumers. And when content creators work with that, and allow some of that to happen, that creates engagement that helps. Most successful brand owners let others play with their brand.

A6 – PC) Piracy is an issue… But we even use piracy data sources for data analysis. Using bit torrent to understand popularity of shows in other places, to predict how popular they will be in the UK.

Comment – JO) So, pirates are data producers?

A6 – PC) Yes, and for scheduling too.

Q7) How are you dealing with cross channel or cross platform data – to work with Google or Amazon say. I don’t see much of that with linear TV. Maybe a bit with SVOD. How are mainstream broadcasters challenging that?

A7 – PC) Cross platform can mean different things. It may be Video On Demand as well as broadcast on their TV. We can’t assume they are different, and should look to understand what the connections are there… We are so conscious and cautious of using third party data… But we can do some content matching – e.g. advertiser customer base, and much more personalised. A real link between publisher and advertiser.

Q7 follow up) Would customer know that is taking place?

A7 – PC) It is an option at sign up. Many say “yes” to that question.

A7 – PG) We still have a lot to do to track the consumer across platforms, so a viewer can pick up consuming content from one platform to another. This technology is pretty immature, an issue with recommendation engines too.

A7 – SP) We do have relationships with third party data companies that augment what we collect – different from what a broadcaster would do. For this it tends to be non identifiable… BUt you have to trust the analyst to have combined data appropriately. You have to understand their method and process, but usually they have to infer from data anyway as usually don’t have source.

Q8 – JO) We were talking about unreliable technologies and opportunities… So, where do you see wearable technologies perhaps?

A8 – SP) We did some work using facial recognition to understand the usefulness of recommendations. That was interesting but deploying that comes with a lot of privacy issues. And devices etc. also would raise those issues.

A8 – PC) We aren’t looking at that sort of data… But data like weather matters for this industry, local events, traffic information – as context for consumption etc. That is all being considered as context for analysis. But we also share our data science with creative colleagues – that, say, technology will tell you when content is performed/shown. There is a subjective human aspect that they want to see, to dissect elements of content so machine can really learn… So is there sex involved… Who is the director, who is the actress… So many things you can put in the system to find this stuff out. Forecasting really is important in this industry.

A8 – HBG) The human element is interesting. Serendipity is interesting. From neuroscientist point of view I always worry about the act of measure… We see all the time that you can see the same audience, same demographic, watching the same content and reacting totally differently at different times of day etc. And live vs catch up say. My fear, and a great challenge, is how to get a neuroscience experiment valid in that context.

Q9 – from me) What happens if the data is not there in terms of content, or recommendation engines – if the data you have tells you there is a need for something you don’t currently have available. Are you using data science to inform production or content creation, or for advertising?

A9 – SP) The research we are currently doing is looking at ways to get much better data from viewers – trying things like a Tinder-like playful interface to really get a better understanding of what users want. But we also, whenever there are searches etc. capture not only what is available on that platform but also what is in demand but not yet available, and also provding details of that search iss to commissioning teams to inform what they do.

A9 – PG) There are some interesting questions about what is most valuable… So. you see Amazon Prime deciding on vale of Jeremy Clarkson and Top Gear team… And i think you will increasingly see purchasing based on data. And when it comesto commissioning we are looking to understand gaps in our portfolio.

A9 – PC) We are definitely interested in that. VOD is a proactive thing… YOu choose as a view… So we have an idea of micro genres that are specific to you… So we have say, Sex/Pervert corner; we have teenage american comedy; etc. and you can see how micro genres are panning out… And you can then telling commissioners what is happening on a video on demand side… BUt that’s different to commissioning for TV, and convincing that

A9 – HBG) I think that you’ve asked the single greatest question at a data science conference: what do you do if the data is not there. And sometimes you have to take a big leap to do something you can’t predict it… And that happens when you have to go beyond the possibilities of the data, and just get out there and do it.

A9 – SP) The concern is such that the data may start to reduce those leaps and big risks, and that could be a concern.

JO) And that’s a great point to finish on: that no matter how goos the data science we have to look beyond the data.

And after a break we are back… 

BBC – Keynote from Michael Satterthwaite, Senior Product Manager

I am senior project manager on a project called BBC Rewind. We have three projects looking at opportunities, especially around speech to text, from BBC Monitoring, BBC Rewind, and BBC News Labs. BBC Rewind is about maximising value from the BBC archive. But what does “value” mean? Well it can be about money, but I’m much more interested in the other options around value… Can we tell stories, can we use our content to improve people’s health… These are high level aims but we are working with the NHS, Dementia organisations, and running a hack event in Glasgow later this month with NHS, Dementia UK, Dementia Scotland etc. We are wondering if there is any way that we can make someone’s life better…

So, how valued is the BBC’s Archive? I’m told it’s immeasurable but what does that mean? We have content in a range of physical locations some managed by us, some by partners. But is that all valuable if it’s just locked away? What we’ve decided to do to ensure we do get value, is to see how we can extract that value.

So, my young niece, before she was 2 she’d worked out how to get into her mum’s ipad… And her dad works a lot in China, and has an iphone. In an important meeting he’d gotten loads of alerts… Turns out she’d worked out how to take photos of the ceiling and send them to him… How does this relate? Well my brother in law didn’t delete those pictures… And how many of us do delete our photos? [quick poll of the room: very very few delete/curate their digital images]

Storage has gotten so cheap that we have no need to delete. But at the BBC we used to record over content because of the costs of maintaining that content. That reflected the high price of storage – the episodes of Doctor Who taped over to use for other things. That’s a decision for an editor. But the price of storage has dropped so far that we can, in theory, keep everything from programmes to script and script notes, transcripts etc. Thats hard to look through now. Traditionally the solution is humans generating metadata about the content. But as we are now cash strapped and there is so much content… is that sustainable?

So, what about machines – and here’s my Early Learning Centre bit on Machine Learning… It involves a lot of pictures of pandas and a very confused room… to demonstrate a Panda and Not a Panda. When I do this presentation to colleagues in production they see shiny demos of software but don’t understand what the realistic expectations of that machine are. Humans are great at new things and intelligence, new problems and things like that…

Now part two of the demo… some complex maths… Computers are great at scale, at big problems. There is an Alan Turing quote here that seems pertinent, about it not being machine or humans, its finding ways for both to work together. And that means thinking about what machines are good at? Things like initial classification, scale, etc. What are humans good at? Things like classifying the most emotional moment in a talk. And we also need to think about how best we can use machines to complement humans.

But we also need to think about how good is good enough? If you are doing transcripts of an hour long programme, you want 100% or close enough and finish with humans. But if finding a moment in a piece of spoken word, you need to find the appropriate words for that search. That means your transcript might be very iffy but as long as it’s good enough to find those key entities. We can spend loads of time and money getting something perfect, when there is much more value in getting work to a level of good enough to do something useful and productive.

This brings me to BBC Rewind. The goal of this project is to maximise the value from the BBC Archives. We already have a lot of digitised content for lots of reasons – often to do with tape formats dying out and the need to build new proxies. And we are doing more digitising of selected parts of the BBC Archives. And we are using a mixture of innovative human and computer approaches to enrichment. And looking at new ways to use archives in our storytelling of audiences.

One idea we’ve tried is BBC Your Story which creates a biography based on your own life story, through BBC Archive content. It is incredibly successful as a prototype but we are looking at how we can put that into production, and make that more personalised.

We’ve also done some work on Timeline, and we wanted to try out semantic connections etc. but we don’t have all our content marked up as we would need so we did some hand mark up to try the idea out. My vision is that we want to reach a time when we can search for:

“Vladimir Putin unhappily shaking hands with Western Leaders in the rain at the G8, whilst expressing his happiness.” 

So we could break that into many parts requiring lots of complex mark up of content to locate suitable content.

At the moment BBC Rewind includes speech-to-text in English based on the Kaldi toolset – it’s maybe 45% accurate off the shelf – but that’s 45% more of the words than you had before, and a confidence value; Speech-to-text in the Welsh language; Voice identification; speaker segmentation – Speech recognition that identify speakers is nice, but we don’t need that just yet. And even if we did we don’t need that person to be named (a human can tag that easily) and then train algorithms off that; face recognition – is good but hard to scale, we’ve been doing some work with Oxford University in that area. And we get to context…. Brian Cox versus (Dr) Brian Cox can be disentangled with some basic contextual information.

Finally, we have an exciting announcement. We have BBC Monitoring – a great example of how we can use machines to help human beings in their monitoring media. So we will be creating tools to enable monitoring of media. In this project BBC are partnering with University of Edinburgh, UCL, Deutsche Welle and others in an EU funded Horizon 2020 project called SUMMA – this project has four workstreams and we are keen to make new partnerships

The BBC now runs tech hack events which resulted in new collaborations – including SUMMA – more hack events coming soon so contact Susanne Weber, Language Technology Producer in BBC News Labs. The first SUMMA hack event, will be end of next year and will focus on the automated monitoring of multi-media sources: audio-visual, text etc.

Lets try stuff faster and work out what works – and what doesn’t – more quickly!

Unlocking Value from Media Panel Session – Moderator: Simon King, University of Edinburgh

Our panel is…

Michael Satterthwaite – Senior Product Manager, BBC
Adam Farqhuar – Head of Digital Scholarship, British Library
Gary Kazantsev R&D Machine Learning Group, Bloomberg
Richard Callison – brightsolid (DC Thomson and Scottish Power joint initiative)

Q1 – SK) Lets start with that question of what value might be, if not financial?

A1 – GK) Market transparancy, business information – there are quantitative measures for some of these things. But a very hard problem in general.

A1 – AF) We do a lot of work on value in the UK, and economic impact, but we also did some work a few years back sharing digitised resources onto Flickr and that generated huge excitement and interest. That’s a great example of where you can create valuge by being open, rather than monetising early on.

A1 – MS) Understanding value is really interesting. Getty uses search to aid discovery and they have learned that you can use search to do that, to use the data you are capturing to ensure users access what they want and want to buy quickly. For us, with limited resources, the best way to understand value and impact is to try things out a bit, to see what works and what happens.

A1 – AF) Putting stuff out there without much metadata can give you some really great crowd data. With a million images we shared, our crowd identified maps from those materials. And that work was followed up with georeferencing those maps on the globe. So, even if you think there couldn’t possibly be enough of a community interested in doing this stuff, you can find that there really is that interest and who want to help…

A1 – MS) And you can use that to prioritise what you do next, what you digitise next, etc.

Q2 – SK) Which of the various formats of media are most difficult to do?

A2 – MS) Images are relatively straight forward but video is essentially 25 pictures per second… That’s a lot of content… That means sampling content else we’d crash even Amazon with the scale of work we have. And that sampling allows you to understand time, an aspect that makes video so tricky.

Q3 – SK) Is there a big difference between archive and current data…

A3 – RC) For me the value of content is often about extracting value from very local context, And it leads back to several things said earlier, about perhaps taking a leap of faith into areas the data doesn’t show, and which could be useful in the future… So we’ve done hand written data which was the only Census that was all handwritten – 32m rows of records on England and Wales and had to translate that to text… We just went offshore, the BPO outsourced… That was just a commercial project as we knew there was historical and genealogical interest… But not so many data sets like that around.

But working with the British Library we’ve done digitisation of newspapers both from originals and microfilm. OCR isn’t perfect but it gets it out there… The increase we have in multimedia online trigged by broadcast – Who Do You Think You Are? triggers huge interest in these services and we were in the right place at the right time to make that work.

A3 – GK) We are in an interesting position as Bloomberg creates it’s own data but we also ingest more than 1 million news documents in 30 languages from 120k sources. The Bloomberg newsroom started in 1990 and they had the foresight to collect clean clear digital data from the beginning of our work. That’s great for accessing, but extracting data is different… For some issues like semantic mark up and entity disambiguation… And huge issues of point in time correctness – named entities changing meanings over time. And unless someone encoded that into the information, then it is very difficult to disambiguate. And the value of this data, it’s role in trading etc., needs to be reliable.

I kind of don’t recognise Mike’s comments on video as there is object recognition available as an option… But I think we get more value out of text than most people, and we get real value from audience. Transcription and beyond… Entity recognition, dialogue structure, event extraction… A fairly long NLP pipeline there…

A3 – AF) The description of what you want to identify, those are very similar desires to those we want in the hunanities, and has additional benefit to journalists too. Is text search enough? Not really. They are an interesting way in… But text isn’t the best way to understand either historical images in a range of books, but also isn’t that useful in the context of the UK Web Archive and images in that. Much of what may be of interest is not the text, but perhaps better reduced to a series of shapes etc.

Q4) There has been a mention of crowd sourcing already and I was wondering about that experience, what worked and did not work, and thinking back to Mike’s presentation about what might work better?

A4 – AF) We found that smaller batches worked better… People love to see progress, like to have a sense of accomplishment. We found rewards were nice – we offered lunch with the head of maps at the British Library and that was important. Also mix it up – so not always the same super hard problems all the time

A4 – MS) I was going to give the BL example of your games machine… A mix of crowdsourcing and gamification.

A4 – AF) It’s very experimental but, as mentioned in the earlier panel session about the Tinder-like app. So we’ve worked with Adam Crimble to build an arcade game to do image classification and we are interested to see if people will use their time differently with this device. Will they classify images, help us build up our training sets. But the idea is that it’s engagement away from desktop or laptops…

A4 – RC) We have tried crowdsourcing for corrections. Our services tend to be subscriptions and Pay as You Go. But people still see value in contributing. And you can incentivise that stuff. And you see examples across the world where centrally or government websites are using crowd sourcing for transcription.

A4 – GK) You could argue that we were innovators in crowd sourcing at Bloomsberg, through blogs etc. And through tagging of entities. What we have learned from crowdsourcing is that it isn’t good for everything. But hard when specialist knowledge is needed, or specific languages needed – hard to get people to tag in Japanese. We aren’t opposed to paying for contribution but you have to set it up effectively. We found you have to define tasks very specifically for instance.

Q5) Talking about transposing to text implies that that is really possible. If we can’t do image descriptions effectively with text then what else should we be doing… I was wondering what the panel thought in terms of modalities of data…

A5 – MS) Whatever we do to mark up content is only as good as our current tools, understanding, modalities. And we’d want to go back and mark it up differently. In Google you can search for an image with an image… It’s changed over time… Now it uses text on the page to gather context and present that as well as the image back to you… If you can store a fingerprint to compare to others… We are doing visual searches. searches that are not text based. Some of these things already exist and they will get better and better. And the ability to scale and respond will be where the money is.

Q6) The discussion is quite interesting as at the moment it’s about value you define… But you could see the BBC as some form of commons… It could be useful for local value, for decision making, etc. where you are not in a positiion to declare the value… And there are lots of types of values out there, particularly in a global market.

A6 – MS) The BBC have various rules and regulations about publishing media, one of which is humans always have to check content and that is a real restriction on scale, particularly as we are looking to reduce staff. We ran an initiative called MCB with University of Edinburgh that opened some of the idea But ideally we would have every single minute of broadcast TV and radio into the public domain… But we don’t have the rights to everything… In many cases we acquired content before digital which means that you need to renegotiate content licenses etc. before digitising etc.

A6 – AF) Licenses can be an issue, privacy and data protection can be an issue. But we also have the challenge of how we meet user needs and actually listening to those needs. Someone we have to feel comfortable providing a lower level service, and may require higher skills (e.g. coding) to use… That can be something wonderful, not just super polished services required. But that has to be a service that is useful and valuable. But that’s super useful. And things will change in terms of what is useful, what is possible, etc.

A6 – GK) For us it’s an interesting question. Our users won’t say what they want, so you have to reverse engineer then do rapid product development… So we do what you (Micheal) suggest – building rapid prototypes to try ideas out. But this isn’t just a volatile time, but a volatile decade, more!

Q7) Can you tell us anything about how you manage the funnel for production, and how context is baked in in content creation process…

A7 – GK) There is a whole toolset for creating and encoding metadata, and doing so in a way meaningful to people beyond the organisation.. But I could talk about that for an hour so better to talk about this later I think.

Q8 – SK) How multilingual do you actually need to be in your work?

A8 – GK) We currently ingest content in 34 languages, but 10 languages cover the majority – but things changes quickly. Used to be 90% of content ingested was in English, now 70-80%. That’s a shift… We have not yet seen the case that suddenly lots of data that appears in a language where there was previously none. Instead we see particularly well resourced languages. Japanese is a large well resourced language and many resources in place, but very tricky from a computational perspective. And that can mean you still need humans.

A8 – MS) I probably have a different perspective on languages… We have BBC Research working in Africa with communities just going online for the first time. There are hundreds of new languages in Africa, but none will be a huge language… A few approaches… Can either translate directly, or you can convert into English, then translate from there. Some use speech to text – with Stephen Hawking type voice to provide continuity.

A8 – AF) Our collections cover all languages at all times… an increasingly difficult challenge.

Comment  – Susanne, BBC) I wanted to comment on speed of access to different language. All it takes is a catastrophe like an Ebola outbreak… Or disaster in Ukraine, or in Turkey… And you suddenly have the use case for ASR – machine translation. And you see audience expectations there.

A8 – MS) And you could put £1M into many languages and make little impact… But if you put that into one key language, e.g. Pashtu you might have more impact… We need to consider that in our funding and prioritisation.

A8 – GK) Yes, one disaster or event can make a big difference… If you provide the tools for them to access information and addt their own typing of their language… In the case of, say, Ebola you needed doctors speaking the language of the patient… But I’m not sure there is a technological solution. Similarly a case on the Amazon… Technology cannot always help here.

Q9) Do you have concerns that translations might be interpreted in different contexts and be misinterpreted? And the potential to get things massively wrong in another language. Do you have systems (human or machine) to deal with that?

A9 – AF) I won’t quite answer your question but a related thing… In some sense that’s the problem of data… Data becomes authoritative and unless we make it accessible, cite it, explain how it came about… Then it becomes authoritative. So we have large data collections being made available – BBC, BL etc. – and they can be examined in a huge set of new ways… They require different habits, tools, approaches than many of us are used to using, and different tools that e.g. academics in the humanities. And we need to emphasise the importance of proper citing, sharing, describing etc.

A9 – MS) I’d absolutely agree about transparency. Another of Susanne’s projects, Babel, is giving a rough translation that can then be amended. But an understanding of the context is so important.

A9 – GK) We had a query last week, in German, for something from Der Speigel… Got translated to The Mirror… But there is a news source called The Mirror… So translating makes sense… Except you need that outside data to be able to make sense of this stuff… It’s really an open question about where that should be and how you would do that.

Q10 – SK) So, a final question: What should ATI do in this space?

A10 – RC) For us we’d like to see what can be done on an SME level, and some product to go to market…

A10 – GK) I think that there are quite a lot of things that the ATI can do… I think there is a lot of stuff the industry won’t beat you too – the world is changing too rapidly for that. I think the University, the ATI should be better connected to industry – and I’ll talk about that tomorrow.

A10 – AF) As a national institution has a lot of data and content, but the question is how we can make sense of it… That large collection of data and content. The second issue is Skills – there is a lot to learn about data and working with large data collections. And thirdly there is convening… data and content, technologists, and researchers with questions to ask of the data and I think ATI can be really effective in bringing those people together.

A10 – MS) We were at an ideas hack day at the British Library a few weeks back and that was a great opportunity to get those people who create data, who research etc. and bringing it together. And I think ATI should be the holder of best practice to connect the holders of content, academia, etc. to work together to add value. For me trying to independently add value where it counts really makes a difference. For instance we are doing some Welsh speech to text work which is work I’m keen to share with others  in some way…

SK: Is there anything else that anyone here wants to add to the ATI to do list ?

Comment: I want to see us get so much better at multilingual support, the babelfish for all spoken languages ideally!

 

Closing Remarks – Steve Renals, Informatics, University of Edinburgh

I think today is something of a kick off for building relationships and we’ve seen some great opportunities today. And there will be more opportunity to do this over drinks as we finish for today.

And with that we are basically done, save for a request to hand in our badges in exchange for a mug – emblazoned with an Eduardo Paolazzi inspired by a biography of Alan Turing – in honour of Turing’s unusual attachment to his mug (which used to be chained to the radiator!).

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)