Jun 152017
 

I am again at the IIPC WAC / RESAW Conference 2017 and today I am in the very busy technical strand at the British Library. See my Day One post for more on the event and on the HiberActive project, which is why I’m attending this very interesting event.

These notes are live so, as usual, comments, additions, corrections, etc. are very much welcomed.

Tools for web archives analysis & record extraction (chair Nicholas Taylor)

Digging documents out of the archived web – Andrew Jackson

This is the technical counterpoint to the presentation I gave yesterday… So I talked yesterday about the physical workflow of catalogue items… We found that the Digital ePrints team had started processing eprints the same way…

  • staff looked in an outlook calendar for reminders
  • looked for new updates since last check
  • download each to local folder and open
  • check catalogue to avoid re-submitting
  • upload to internal submission portal
  • add essential metadata
  • submit for ingest
  • clean up local files
  • update stats sheet
  • Then inget usually automated (but can require intervention)
  • Updates catalogue once complete
  • New catalogue records processed or enhanced as necessary.

It was very manual, and very inefficient… So we have created a harvester:

  • Setup: specify “watched targets” then…
  • Harvest (harvester crawl targets as usual) –> Ingested… but also…
  • Document extraction:
    • spot documents in the crawl
    • find landing page
    • extract machine-readable metadata
    • submit to W3ACT (curation tool) for review
  • Acquisition:
    • check document harvester for new publications
    • edit essential metadata
    • submit to catalogue
  • Cataloguing
    • cataloguing records processed as necessary

This is better but there are challenges. Firstly, what is a “publication?”. With the eprints team there was a one-to-one print and digital relationship. But now, no more one-to-one. For example, gov.uk publications… An original report will has an ISBN… But that landing page is a representation of the publication, that’s where the assets are… When stuff is catalogued, what can frustrate technical folk… You take date and text from the page – honouring what is there rather than normalising it… We can dishonour intent by capturing the pages… It is challenging…

MARC is initially alarming… For a developer used to current data formats, it’s quite weird to get used to. But really it is just encoding… There is how we say we use MARC, how we do use MARC, and where we want to be now…

One of the intentions of the metadata extraction work was to provide an initial guess of the catalogue data – hoping to save cataloguers and curators time. But you probably won’t be surprised that the names of authors’ names etc. in the document metadata is rarely correct. We use the worse extractor, and layer up so we have the best shot. What works best is extracting the HTML. Gov.uk is a big and consistent publishing space so it’s worth us working on extracting that.

What works even better is the gov.uk API data – it’s in JSON, it’s easy to parse, it’s worth coding as it is a bigger publisher for us.

But now we have to resolve references… Multiple use cases for “records about this record”:

  • publisher metadata
  • third party data sources (e.g. Wikipedia)
  • Our own annotations and catalogues
  • Revisit records

We can’t ignore the revisit records… Have to do a great big join at some point… To get best possible quality data for every single thing….

And this is where the layers of transformation come in… Lots of opportunities to try again and build up… But… When I retry document extraction I can accidentally run up another chain each time… If we do our Solr searches correctly it should be easy so will be correcting this…

We do need to do more future experimentation.. Multiple workflows brings synchronisation problems. We need to ensure documents are accessible when discoverable. Need to be able to re-run automated extraction.

We want to iteratively improve automated metadata extraction:

  • improve HTML data extraction rules, e.g. Zotero translators (and I think LOCKSS are working on this).
  • Bring together different sources
  • Smarter extractors – Stanford NER, GROBID (built for sophisticated extraction from ejournals)

And we still have that tension between what a publication is… A tension between established practice and publisher output Need to trial different approaches with catalogues and users… Close that whole loop.

Q&A

Q1) Is the PDF you extract going into another repository… You probably have a different preservation goal for those PDFs and the archive…

A1) Currently the same copy for archive and access. Format migration probably will be an issue in the future.

Q2) This is quite similar to issues we’ve faced in LOCKSS… I’ve written a paper with Herbert von de Sompel and Michael Nelson about this thing of describing a document…

A2) That’s great. I’ve been working with the Government Digital Service and they are keen to do this consistently….

Q2) Geoffrey Bilder also working on this…

A2) And that’s the ideal… To improve the standards more broadly…

Q3) Are these all PDF files?

A3) At the moment, yes. We deliberately kept scope tight… We don’t get a lot of ePub or open formats… We’ll need to… Now publishers are moving to HTML – which is good for the archive – but that’s more complex in other ways…

Q4) What does the user see at the end of this… Is it a PDF?

A4) This work ends up in our search service, and that metadata helps them find what they are looking for…

Q4) Do they know its from the website, or don’t they care?

A4) Officially, the way the library thinks about monographs and serials, would be that the user doesn’t care… But I’d like to speak to more users… The library does a lot of downstream processing here too..

Q4) For me as an archivist all that data on where the document is from, what issues in accessing it they were, etc. would extremely useful…

Q5) You spoke yesterday about engaging with machine learning… Can you say more?

A5) This is where I’d like to do more user work. The library is keen on subject headings – thats a big high level challenge so that’s quite amenable to machine learning. We have a massive golden data set… There’s at least a masters theory in there, right! And if we built something, then ran it over the 3 million ish items with little metadata could be incredibly useful. In my 0pinion this is what big organisations will need to do more and more of… making best use of human time to tailor and tune machine learning to do much of the work…

Comment) That thing of everything ending up as a PDF is on the way out by the way… You should look at Distil.pub – a new journal from Google and Y combinator – and that’s the future of these sorts of formats, it’s JavaScript and GitHub. Can you collect it? Yes, you can. You can visit the page, switch off the network, and it still works… And it’s there and will update…

A6) As things are more dynamic the re-collecting issue gets more and more important. That’s hard for the organisation to adjust to.

Nick Ruest & Ian Milligan: Learning to WALK (Web Archives for Longitudinal Knowledge): building a national web archiving collaborative platform

Ian: Before I start, thank you to my wider colleagues and funders as this is a collaborative project.

So, we have a fantastic web archival collections in Canada… They collect political parties, activist groups, major events, etc. But, whilst these are amazing collections, they aren’t accessed or used much. I think this is mainly down to two issues: people don’t know they are there; and the access mechanisms don’t fit well with their practices. Maybe when the Archive-it API is live that will fix it all… Right now though it’s hard to find the right thing, and the Canadian archive is quite siloed. There are about 25 organisations collecting, most use the Archive-It service. But, if you are a researcher… to use web archives you really have to interested and engaged, you need to be an expert.

So, building this portal is about making this easier to use… We want web archives to be used on page 150 in some random book. And that’s what the WALK project is trying to do. Our goal is to break down the silos, take down walls between collections, between institutions. We are starting out slow… We signed Memoranda of Understanding with Toronto, Alberta, Victoria, Winnipeg, Dalhousie, Simon Fraser University – that represents about half of the archive in Canada.

We work on workflow… We run workshops… We separated the collections so that post docs can look at this

We are using Warcbase (warcbase.org) and command line tools, we transferred data from internet archive, generate checksums; we generate scholarly derivatives – plain text, hypertext graph, etc. In the front end you enter basic information, describe the collection, and make sure that the user can engage directly themselves… And those visualisations are really useful… Looking at visualisation of the Canadian political parties and political interest group web crawls which track changes, although that may include crawler issues.

Then, with all that generated, we create landing pages, including tagging, data information, visualizations, etc.

Nick: So, on a technical level… I’ve spent the last ten years in open source digital repository communities… This community is small and tight-knit, and I like how we build and share and develop on each others work. Last year we presented webarchives.ca. We’ve indexed 10 TB of warcs since then, representing 200+ M Solr docs. We have grown from one collection and we have needed additional facets: institution; collection name; collection ID, etc.

Then we have also dealt with scaling issues… 30-40Gb to 1Tb sized index. You probably think that’s kinda cute… But we do have more scaling to do… So we are learning from others in the community about how to manage this… We have Solr running on an Open Stack… But right now it isn’t at production scale, but getting there. We are looking at SolrCloud and potentially using a Shard2 per collection.

Last year we had a Solr index using the Shine front end… It’s great but… it doesn’t have an active open source community… We love the UK Web Archive but… Meanwhile there is BlackLight which is in wide use in libraries. There is a bigger community, better APIs, bug fixes, etc… So we have set up a prototype called WARCLight. It does almost all that Shine does, except the tree structure and the advanced searching..

Ian spoke about derivative datasets… For each collection, via Blacklight or ScholarsPortal we want domain/URL Counts; Full text; graphs. Rather than them having to do the work, they can just engage with particular datasets or collections.

So, that goal Ian talked about: one central hub for archived data and derivatives…

Q&A

Q1) Do you plan to make graphs interactive, by using Kibana rather than Gephi?

A1 – Ian) We tried some stuff out… One colleague tried R in the browser… That was great but didn’t look great in the browser. But it would be great if the casual user could look at drag and drop R type visualisations. We haven’t quite found the best option for interactive network diagrams in the browser…

A1 – Nick) Generally the data is so big it will bring down the browser. I’ve started looking at Kibana for stuff so in due course we may bring that in…

Q2) Interesting as we are doing similar things at the BnF. We did use Shine, looked at Blacklight, but built our own thing…. But we are looking at what we can do… We are interested in that web archive discovery collections approaches, useful in other contexts too…

A2 – Nick) I kinda did this the ugly way… There is a more elegant way to do it but haven’t done that yet..

Q2) We tried to give people WARC and WARC files… Our actual users didn’t want that, they want full text…

A2 – Ian) My students are quite biased… Right now if you search it will flake out… But by fall it should be available, I suspect that full text will be of most interest… Sociologists etc. think that network diagram view will be interesting but it’s hard to know what will happen when you give them that. People are quickly put off by raw data without visualisation though so we think it will be useful…

Q3) Do you think in few years time

A3) Right now that doesn’t scale… We want this more cloud-based – that’s our next 3 years and next wave of funded work… We do have capacity to write new scripts right now as needed, but when we scale that will be harder,,,,

Q4) What are some of the organisational, admin and social challenges of building this?

A4 – Nick) Going out and connecting with the archives is a big part of this… Having time to do this can be challenging…. “is an institution going to devote a person to this?”

A4 – Ian) This is about making this more accessible… People are more used to Backlight than Shine. People respond poorly to WARC. But they can deal with PDFs with CSV, those are familiar formats…

A4 – Nick) And when I get back I’m going to be doing some work and sharing to enable an actual community to work on this..

Gregory Wiedeman: Automating access to web archives with APIs and ArchivesSpace

A little bit of context here… University at Albany, SUNY we are a public university with state records las that require us to archive. This is consistent with traditional collecting. But we no dedicated web archives staff – so no capacity for lots of manual work.

One thing I wanted to note is that web archives are records. Some have paper equivalent, or which were for many years (e.g. Undergraduate Bulletin). We also have things like word documents. And then we have things like University sports websites, some of which we do need to keep…

The seed isn’t a good place to manage these as records. But archives theory and practices adapt well to web archives – they are designed to scale, they document and maintain context, with relationship to other content, and a strong emphasis on being a history of records.

So, we are using DACS: Describing Archives: A Content Standard to describe archives, why not use that for web archives? They focus on intellectual content, ignorant of formats; designed for pragmatic access to archives. We also use ArchiveSpace – a modern tool for aggregated records that allows curators to add metadata about a collection. And it interleaved with our physical archives.

So, for any record in our collection.. You can specify a subject… a Python script goes to look at our CDX, looks at numbers, schedules processes, and then as we crawl a collection the extents and data collected… And then shows in our catalogue… So we have our paper records, our digital captures… Uses then can find an item, and only then do you need to think about format and context. And, there is an awesome article by David Graves(?) which talks about that aggregation encourages new discovery…

Users need to understand where web archives come from. They need provenance to frame of their research question – it adds weight to their research. So we need to capture what was attempted to be collected – collecting policies included. We have just started to do this with a statement on our website. We need a more standardised content source. This sort of information should be easy to use and comprehend, but hard to find the right format to do that.

We also need to capture what was collected. We are using the Archive-It Partner Data API, part of the Archive-It 5.0 system. That API captures:

  • type of crawl
  • unique ID
  • crawl result
  • crawl start, end time
  • recurrence
  • exact data, time, etc…

This looks like a big JSON file. Knowing what has been captured – and not captured – is really important to understand context. What can we do with this data? Well we can see what’s in our public access system, we can add metadata, we can present some start times, non-finish issues etc. on product pages. BUT… it doesn’t address issues at scale.

So, we are now working on a new open digital repository using the Hydra system – though not called that anymore! Possibly we will expose data in the API. We need standardised data structure that is independent of tools. And we also have a researcher education challenge – the archival description needs to be easy to use, re-share and understand.

Find our work – sample scripts, command line query tools – on Github:

http://github.com/UAlbanyArchives/describingWebArchives

Q&A

Q1) Right now people describe collection intent, crawl targets… How could you standardise that?

A1) I don’t know… Need an intellectual definition of what a crawl is… And what the depth of a crawl is… They can produce very different results and WARC files… We need to articulate this in a way that is clear for others to understand…

Q1) Anything equivalent in the paper world?

A1) It is DACS but in the paper work we don’t get that granular… This is really specific data we weren’t really able to get before…

Q2) My impression is that ArchiveSpace isn’t built with discovery of archives in mind… What would help with that…

A2) I would actually put less emphasis on web archives… Long term you shouldn’t have all these things captures. We just need an good API access point really… I would rather it be modular I guess…

Q3) Really interesting… the definition of Archive-It, what’s in the crawl… And interesting to think about conveying what is in the crawl to researchers…

A3) From what I understand the Archive-It people are still working on this… With documentation to come. But we need granular way to do that… Researchers don’t care too much about the structure…. They don’t need all those counts but you need to convey some key issues, what the intellectual content is…

Comment) Looking ahead to the WASAPI presentation… Some steps towards vocabulary there might help you with this…

Comment) I also added that sort of issue for today’s panels – high level information on crawl or collection scope. Researchers want to know when crawlers don’t collect things, when to stop – usually to do with freak outs about what isn’t retained… But that idea of understanding absence really matters to researchers… It is really necessary to get some… There is a crapton of data in the partners API – most isn’t super interesting to researchers so some community effort to find 6 or 12 data points that can explain that crawl process/gaps etc…

A4) That issue of understanding users is really important, but also hard as it is difficult to understand who our users are…

Harvesting tools & strategies (Chair: Ian Milligan)

Jefferson Bailey: Who, what, when, where, why, WARC: new tools at the Internet Archive

Firstly, apologies for any repetition between yesterday and today… I will be talking about all sorts of updates…

So, WayBack Search… You can now search WayBackMachine… Including keyword, host/domain search, etc. The index is build on inbound anchor text links to a homepage. It is pretty cool and it’s one way to access this content which is not URL based. We also wanted to look at domain and host routes into this… So, if you look at the page for, say, parliament.uk you can now see statistics and visualisations. And there is an API so you can make your own visualisations – for hosts or for domains.

We have done stat counts for specific domains or crawl jobs… The API is all in json so you can just parse this for, for example, how much of what is archived for a domain is in the form of PDFs.

We also now have search by format using the same idea, the anchor text, the file and URL path, and you can search for media assets. We don’t have exciting front end displays yet… But I can search for e.g. Puppy, mime type: video, 2014… And get lots of awesome puppy videos [the demo is the Puppy Bowl 2014!]. This media search is available for some of the WayBackMachine for some media types… And you can again present this in the format and display you’d like.

For search and profiling we have a new 14 column CDX including new language, simhash, sha256 fields. Language will help users find material in their local/native languages. The SIMHASH is pretty exciting… that allows you to see how much a page has changed. We have been using it on Archive It partners… And it is pretty good. For instance seeing government blog change month to month shows the (dis)similarity.

For those that haven’t seen the Capture tool – Brozzler is in production in Archive-it with 3 doze orgaisations and using it. This has also led to warcprox developments too. It was intended for AV and social media stuff. We have a chromium cluster… It won’t do domain harvesting, but it’s good for social media.

In terms of crawl quality assurance we are working with the Internet Memory Foundation to create quality toools. These are building on internal crawl priorities work at IA crawler beans, comparison testing. And this is about quality at scale. And you can find reports on how we also did associated work on the WayBackMachine’s crawl quality. We are also looking at tools to monitor crawls for partners, trying to find large scale crawling quality as it happens… There aren’t great analytics… But there are domain-scale monitoring, domain scale patch crawling, and Slack integrations.

For doman scale work, for patch crawling we use WAT analysis for embeds and most linked. We rank by inbound links and add to crawl. ArchiveSpark is a framework for cluster-based data extraction and derivation (WA+).

Although this is a technical presentation we are also doing an IMLS funded project to train public librarians in web archiving to preserve online local history and community memory, working with partners in various communities.

Other collaborations and research include our end of term web archive 2016/17 when the administration changes… No one is official custodian for the gov.uk. And this year the widespread deletion of data has given this work greater profile than usual. This time the work was with IA, LOC, UNT, GWU, and others. 250+ TB of .gov/.mil as well as White House and Obama social media content.

There had already been discussion of the Partner Data API. We are currently re-building this so come talk to me if you are interested in this. We are working with partners to make sure this is useful. makes sense, and is made more relevant.

We take a lot of WARC files from people to preserve… So we are looking to see how we can get partners to do this with and for it. We are developing a pipeline for automated WARC ingest for web services.

There will be more on WASAPI later, but this is part of work to ensure web archives are more accessible… And that uses API calls to connect up repositories.

We have also build a WAT API that allows you to query most of the metadta for a WARC file. You can feed it URLs, and get back what you want – except the page type.

We have new portals and searches now and coming. This is about putting new search layers on TLD content in the WayBackMachine… So you can pick media types, and just from one domain, and explore them all…

And with a statement on what archives should do – involving a gif of a centaur entering a rainbow room – that’s all… 

Q&A

Q1) What are implications of new capabilities for headless browsing for Chrome for Brozzler…

A1 – audience) It changes how fast you can do things, not really what you can do…

Q2) What about http post for WASAPI

A2) Yes, it will be in the Archive-It web application… We’ll change a flag and then you can go and do whatever… And there is reporting on the backend. Doesn’t usually effect crawl budgets, it should be pretty automated… There is a UI.. Right now we do a lot manually, the idea is to do it less manually…

Q3) What do you do with pages that don’t specify encoding… ?

A3) It doesn’t go into url tokenisation… We would wipe character encoding in anchor text – it gets cleaned up before elastic search..

Q4) The SIMHASH is before or after the capture? And can it be used for deduplication

A4) After capture before CDX writing – it is part of that process. Yes, it could be used for deduplication. Although we do already do URL deduplication… But we could compare to previous SIMHASH to work out if another copy is needed… We really were thinking about visualising change…

Q5) I’m really excited about WATS… What scale will it work on…

A5) The crawl is on 100 TB – we mostly use existing WARC and Json pipeline… It performs well on something large. But if a lot of URLs, it could be a lot to parse.

Q6) With quality analysis and improvement at scale, can you tell me more about this?

A6) We’ve given the IMF access to our own crawls… But we have been compared our own crawls to our own crawls… Comparing to Archive-it is more interesting… And looking at domain level… We need to share some similar size crawls – BL and IA – and figure out how results look and differ. It won’t be content based at that stage, it will be hotpads and URLs and things.

Michele C. Weigle, Michael L. Nelson, Mat Kelly & John Berlin: Archive what I see now – personal web archiving with WARCs

Mat: I will be describing tools here for web users. We want to enable individuals to create personal web archives in a self-contained way, without external services. Standard web archiving tools are difficult for non IT experts. “Save page as” is not suitable for web archiving. Why do this? It’s for people who don’t want to touch the commend line, but also to ensure content is preserved that wouldn’t otherwise be. More archives are more better.

It is also about creation and access, as both elements are important.

So, our goals involve advancing development of:

  • WARCreate – create WARC from what you see in your browser.
  • Web Archiving Integration Layer (WAIL)
  • Mink

WARCcreate is… A Chrome browser extension to save WARC files from your browser, no credentials pass through 3rd parties. It heavilt leverages Chrome webRequest API. ut it was build in 2012 so APIs and libraries have evolved so we had to work on that. We also wanted three new modes for bwoser based preservation: record mode – retain buffer as you browse; countdown mode – preserve reloading page on an interval; event mode – preserve page when automatically reloaded.

So you simply click on the WARCreate button the browser to generate WARC files for non technical people.

Web Archiving Integration Layer (WAIL) is a stand-alone desktop application, it offers collection-based web archiving, and includes Heritrix for crawling, OpenWayback for replay, and Python scripts compiled to OS-native binaries (.app, .exe). One of the recent advancements was a new user interface. We ported Python to Electron – using web technologies to create native apps. And that means you can use native languages to help you to preserve. We also moves from a single archive to collection-based archiving. We also ported OpenWayback to pywb. And we also started doing native Twitter integration – over time and hashtags…

So, the original app was a tool to enter a URI and then get a notification. The new version is a little more complicated but provides that new collection-based interface. Right now both of these are out there… Eventually we’d like to merge functionality here. So, an example here, looking at the UK election as a collection… You can enter information, then crawl to within defined boundaries… You can kill processes, or restart an old one… And this process integrates with Heritrix to give status of a task here… And if you want to Archive Twitter you can enter a hashtag and interval, you can also do some additional filtering with keywords, etc. And then once running you’ll get notifications.

Mink… is a Google Chrome browser extension. It indicates archival capture count as you browse. Quickly submits URI to multiple archives from UI. From Mink(owski) space. Our recent enhancements include enhancements to the interface to add the number of archives pages to icon at bottom of page. And allows users to set preferences on how to view large set of memetos. And communication with user-specified or local archives…

The old mink interface could be affected by page CSS as in the DOM. So we ave moved to shadow DOM, making it more reliable and easy to use. And then you have a more consistent, intuitive iller columns for many captures. It’s an integration of live and archive web, whilst you are viewing the live web. And you can see year, month, day, etc. And it is refined to what you want to look at this. And you have an icon in Mink to make a request to save the page now – and notification of status.

So, in terms of tool integration…. We want to ensure integration between Mink and WAIL so that Mink points to local archives. In the future we want to decouple Mink from external Memento aggregator – client-side customisable collection of archives instead.

See: http://bit.ly/iipcWAC2017 for tools and source code.

Q&A

Q1) Do you see any qualitative difference in capture between WARCreate and WARC recorder?

A1) We capture the representation right at the moment you saw it.. Not the full experience for others, but for you in a moment of time. And that’s our goal – what you last saw.

Q2) Who are your users, and do you have a sense of what they want?

A2) We have a lot of digital humanities scholars wanting to preserve Twitter and Facebook – the stream as it is now, exactly as they see it. So that’s a major use case for us.

Q3) You said it is watching as you browse… What happens if you don’t select a WARC

A3) If you have hit record you could build up content as pages reload and are in that record mode… It will impact performance but you’ll have a better capture…

Q3) Just a suggestion but I often have 100 tabs open but only want to capture something once a week so I might want to kick it off only when I want to save it…

Q4) That real time capture/playback – are there cool communities you can see using this…

A4) Yes, I think with CNN coverage of a breaking storm allows you to see how that story evolves and changes…

Q5) Have you considered a mobile version for social media/web pages on my phone?

A5) Not currently supported… Chrome doesn’t support that… There is an app out there that lets you submit to archives, but not to create WARC… But there is a movement to making those types of things…

Q6) Personal archiving is interesting… But jailed in my laptop… great for personal content… But then can I share my WARC files with the wider community .

A6) That’s a good idea… And more captures is better… So there should be a way to aggregate these together… I am currently working on that, but you should need to be able to specify what is shared and what is not.

Q6) One challenge there is about organisations and what they will be comfortable with sharing/not sharing.

Lozana Rossenova and IIya Kreymar, Rhizome: Containerised browsers and archive augmentation

Lozana: As you probably know Webrecorder is a high fidelity interactive recording of any web site you browse – and how you engage. And we have recently released an App in electron format.

Webrecorder is a worm’s eye view of archiving, tracking how users actually move around the web… For instance for instragram and Twitter posts around #lovewins you can see the quality is high. Webrecorder uses symmetrical archiving – in the live browser and in a remote browser… And you can capture then replay…

In terms of how we organise webrecorder: we have collections and sessions.

The thing I want to talk about today is on Remote browsers, and my work with Rhizome on internet art. And a lot of these works actually require old browser plugins and tools… So Webrecorder enables capture and replay even where technology no longer available.

To clarify: the programme says “containerised” but we now refer to this as “remote browsers” – still using Docker cotainers to run these various older browsers.

When you go to record a site you select the browser, and the site, and it begins the recording… The Java Applet runs and shows you a visulisation of how it is being captured. You can do this with flash as well… If we open a multimedia in your normal (Chrome) browser, it isn’t working. Restoration is easier with just flash, need other things to capture flash with other dependencies and interactions.

Remote browsers are really important for Rhizome work in general, as we use them to stage old artworks in new exhibitions.

Ilya: I will be showing some upcoming beta features, including ways to use webrecorder to improve other arhives…

Firstly, which other web archives? So I built a public web archives repsitory:

https://github.com/webrecorder/public-web-archives

And with this work we are using WAM – the Web Archiving Manifest. And added a WARC source URI and WARC creation date field to the WARC Header at the moment.

So, Jefferson already talked about patching – patching remote archives from the live web… is an approach where we patch either from live web or from other archives, depending on what is available or missing. So, for instance, if I look at a Washington Post page in the archive from 2nd March… It shows how other archives are being patched in to me to deliver me a page… In the collection I have a think called “patch” that captures this.

Once pages are patched, then we introduce extraction… We are extracting again using remote archiving and automatic patching. So you combine extraction and patching features. You create two patches and two WARC files. I’ll demo that as well… So, here’s a page from the CCA website and we can patch that… And then extract that… And then when we patch again we get the images, the richer content, a much better recording of the page. So we have 2 WARCs here – one from the British Library archive, one from the patching that might be combined and used to enrich that partial UKWA capture.

Similarly we can look at a CNN page and take patches from e.g. the Portuguese archive. And once it is done we have a more complete archive… When we play this back you can display the page as it appeared, and patch files are available for archives to add to their copy.

So, this is all in beta right now but we hope to release it all in the near future…

Q&A

Q1) Every web archive already has a temporal issue where the content may come from other dates than the page claims to have… But you could aggrevate that problem. Have you considered this?

A1) Yes. There are timebounds for patching. And also around what you display to the user so they understand what they see… e.g. to patch only within the week or the month…

Q2) So it’s the closest date to what is in web recorder?

A2) The other sources are the closest successful result on/closest to the date from another site…

Q3) Rather than a fixed window for collection, seeing frequently of change might be useful to understand quality/relevance… But I think you are replaying

A3)Have you considered a headless browser… with the address bar…

A3 – Lozana) Actually for us the key use case is about highlighting and showcasing old art works to the users. It is really important to show the original page as it appeared – in the older browsers like Netscape etc.

Q4) This is increadibly exciting. But how difficult is the patching… What does it change?

A4) If you take a good capture and a static image is missing… Those are easy to patch in… If highly contextualised – like Facebook, that is difficult to do.

Q5) Can you do this in realtime… So you archive with Perma.cc then you want to patch something immediately…

A5) This will be in the new version I hope… So you can check other sources and fall back to other sources and scenarios…

Comment –  Lozana) We have run UX work with an archiving organisation in Europe for cultural heritage and their use case is that they use Archive-It and do QA the next day… Crawl might mix something but highly dynamic, so want to quickly be able to patch it pretty quickly.

Ilya) If you have an archive that is not in the public archive list on Github please do submit it as a fork request and we’ll be able to add it…

Leveraging APIs (Chair: Nicholas Taylor)

Fernando Melo and Joao Nobre: Arquivo.pt API: enabling automatic analytics over historical web data

Fernando: We are a publicly available web archive, mainly of Portuguese websites from the .pt domain. So, what can you do with out API?

Well, we built our first image search using our API, for instance a way to explore Charlie Hebdo materials; another application enables you to explore information on Portuguese politicians.

We support the Memento protocol, and you can use the Memento API. We are one of the time gates for the time travel searches. And we also have full text search as well as URL search, though our OpenSearch API. We have extended our API to support temporal searches in the portuguese web. Find this at: http://arquivo.pt/apis/opensearch/. Full text search requests can be made through a URL query, e.g. http://arquivp.pt/opensearch?query=euro 2004 would search for mentions of euro 2004, and you can add parameters to this, or search as a phrase rather than keywords.

You can also search mime types – so just within PDFs for instance. And you can also run URL searches – e.g. all pages from the New York Times website… And if you provide time boundaries the search will look for the capture from the nearest date.

Joao: I am going to talk about our image search API. This works based on keyword searches, you can include operators such as limiting to images from a particular site, to particular dates… Results are ordered by relevance, recency, or by type. You can also run advanced image searches, such as for icons, you can use quotation marks for names, or a phrase.

The request parameters include:

  • query
  • stamp – timestamp
  • Start – first index of search
  • safe Image (yes; no; all) – restricts search only to safe images.

The response is returned in json with total results, URL, width, height, alt, score, timestamp, mime, thumbnail, nsfw, pageTitle fields.

More on all of this: http://arquivo.pt/apis

Q&A

Q1) How do you classify safe for work/not safe for work

A1 – Fernando) This is a closed beta version. Safe for work/nsfw is based on classification worked around training set from Yahoo. We are not for blocking things but we want to be able to exclude shocking images if needed.

Q1) We have this same issue in the GifCities project – we have a manually curated training set to handle that.

Comment) Maybe you need to have more options for that measure to provide levels of filtering…

Q2) With that json response, why did you include title and alt text…

A2) We process image and extract from URL, the image text… So we capture the image, the alt text, but we thought that perhaps the page title would be interesting, giving some sense of context. Maybe the text before/after would also be useful but that takes more time… We are trying to keep this working

Q3) What is the thumbnail value?

A3) It is in base 64. But we can make that clearer in the next version…

Nicholas Taylor: Lots more LOCKSS for web archiving: boons from the LOCKSS software re-architecture

This is following on from the presentation myself and colleagues did at last year’s IIPC on APIs.

LOCKSS came about from a serials librarian and a computer scientist. They were thinking about emulating the best features of the system for preserving print journals, allowing libraries to conserve their traditional role as preserver. The LOCKSS boxes would sit in each library, collecting from publishers’ website, providing redundancy, sharing with other libraries if and when that publication was no longer available.

18 years on this is a self-sustaining programme running out of Stanford, with 10s of networks and hundreds of partners. Lots of copies isn’t exclusive to LOCKSS but it is the decentralised replication model that addresses the long term bit integrity is hard to solve, that more (correlated) copies doesn’t necessarily keep things safe and can make it vulnerable to hackers. So this model is community approved, published on, and well established.

Last year we started re-architecting the LOCKSS software so that it becomes a series of websites. Why do this? Well to reduce support and operation costs – taking advantage of other softwares on the web and web archiving,; to de silo components and enable external integration – we want components to find use in other systems, especially in web archiving; and we are preparing to evolve with the web, to adapt our technologies accordingly.

What that means is that LOCKSS systems will treat WARC as a storage abstraction, and more seamlessly do this, processing layers, proxies, etc. We also already integrate Memento but this will also let us engage WASAPI – which there will be more in our next talk.

We have built a service for bibliographic metadata extraction, for web harvest and file transfer content; we can map values in DOM tree to metadata fields; we can retrieve downloadable metadata from expected URL patterns; and parse RIS and XML by schema. That model shows our bias to bibliographic material.

We are also using plugins to make bibliographic objects and their metadata on many publishing platforms machine-intelligible. We mainly work with publishing/platform heuristics like Atypon, Digital Commons, HighWire, OJS and Silverchair. These vary so we have a framework for them.

The use cases for metadata extraction would include applying to consistent subsets of content in larger corpora; curating PA materials within broader crawls; retrieve faculty publications online; or retrieve from University CMSs. You can also undertake discovery via bibliographic metadata, with your institutions OpenURL resolver.

As described in 2005 D-Lib paper by DSHR et al, we are looking at on-access format migration. For instance x-bitmap to GIF.

Probably the most important core preservation capability is the audit and repair protocol. Network nodes conduct polls to validate integrity of distributed copies of data chunks. More nodes = more security – more nodes can be down; more copies can be corrupted… The notes do not trust each other in this model and responses cannot be cached. And when copies do not match, the node audits and repairs.

We think that functionality may be useful in other distributed digital preservation networks, in repository storage replication layers. And we would like to support varied back-ends including tape and cloud. We haven’t built those integrations yet…

To date our progress has addressed the WARC work. By end of 2017 we will have Docker-ised components, have a web harvest framework, polling and repair web service. By end of 2018 we will have IP address and Shibboleth access to OpenWayBack…

By all means follow and plugin. Most of our work is in a private repository, which then copies to GitHub. And we are moving more towards a community orientated software development approach, collaborating more, and exploring use of LOCKSS technologies in other contexts.

So, I want to end with some questions:

  • What potential do you see for LOCKSS technologies for web archiving, other use cases?
  • What standards or technologies could we use that we maybe haven’t considered
  • How could we help you to use LOCKSS technologies?
  • How would you like to see LOCKSS plug in more to the web archiving community?

Q&A

Q1) Will these work with existing LOCKSS software, and do we need to update our boxes?

A1) Yes, it is backwards compatible. And the new features are containerised so that does slightly change the requirements of the LOCKSS boxes but no changes needed for now.

Q2) Where do you store biblographic metadata? Or is in the WARC?

A2) It is separate from the WARC, in a database.

Q3) With the extraction of the metadata… We have some resources around translators that may be useful.

Q4 – David) Just one thing of your simplified example… For each node… They all have to calculate a new separate nonce… None of the answers are the same… They all have to do all the work… It’s actually a system where untrusted nodes are compared… And several nodes can’t gang up on the other… Each peer randomly decides on when to poll on things… There is  leader here…

Q5) Can you talk about format migration…

A5) It’s a capability already built into LOCKSS but we haven’t had to use it…

A5 – David) It’s done on the requests in http, which include acceptable formats… You can configure this thing so that if an acceptable format isn’t found, then you transform it to an acceptable format… (see the paper mentioned earlier). It is based on mime type.

Q6) We are trying to use LOCKSS as a generic archive crawler… Is that still how it will work…

A6) I’m not sure I have a definitive answer… LOCKSS will still be web harvesting-based. It will still be interesting to hear about approaches that are not web harvesting based.

A6 – David) Also interesting for CLOCKSS which are not using web harvesting…

A6) For the CLOCKSS and LOCKSS networks – the big networks – the web harvesting portfolio makes sense. But other networks with other content types, that is becoming more important.

Comment) We looked at doing transformation that is quite straightforward… We have used an API

Q7) Can you say more about the community project work?

A7) We have largely run LOCKSS as more of an in-house project, rather than a community project. We are trying to move it more in the direction of say, Blacklight, Hydra….etc. A culture change here but we see this as a benchmark of success for this re-architecting project… We are also in the process of hiring a partnerships manager and that person will focus more on creating documentation, doing developer outreach etc.

David: There is a (fragile) demo that you can have a lot of this… The goal is to continue that through the laws project, as a way to try this out… You can (cautiously) engage with that at demo.laws.lockss.org but it will be published to GitHub at some point.

Jefferson Bailey & Naomi Dushay: WASAPI data transfer APIs: specification, project update, and demonstration

Jefferson: I’ll give some background on the APIs. This is an IMLS funded project in the US looking at Systems Interoperability and Collaborative Development for Web Archives. Our goals are to:

  • build WARC and derivative dataset APIs (AIT and LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  • Seed and launch community modelled on characteristics of successful development and participation from communities ID’d by project
  • Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  • Technical architecture to support this.

So, we’ve already run WARC and Digital Preservation Surveys. 15-20% of Archive-it users download and locally store their WARCS – for various reasons – that is small and hasn’t really moved, that’s why data transfer was a core area. We are doing online webinars and demos. We ran a national symposium on API based interoperability and digital preservation and we have white papers to come from this.

Development wise we have created a general specification, a LOCKSS implementation, Archive-it implementation, Archive-it API documentation, testing and utility (in progress). All of this is on GitHub.

The WASAPI Archive-it Transfer API is written in python, meets all gen-spec citeria, swagger yaml in the repos. Authorisation uses AIT Django framework (same as web app), not defined in general specification. We are using browser cookies or http basic auth. We have a basic endpoint (in production) which returns all WARCs for that account; base/all results are paginated. In terms of query parameters you can use: filename; filetype; collection (ID); crawl (ID for AID crawl job)) etc.

So what do you get back? A JSON object has: pagination, count, request-url, includes-extra. You have fields including account (Archive-it ID); checksums; collection (Archive-It ID); crawl; craw time; crawl start; filename’ filetype; locations; size. And you can request these through simple http queries.

You can also submit jobs for generating derivative datasets. We use existing query language.

In terms of what is to come, this includes:

  1. Minor AIT API features
  2. Recipes and utilities (testers welcome)
  3. Community building research and report
  4. A few papers on WA APIs
  5. Ongoing surgets and research
  6. Other APIs in WASAPI (past and future)

So we need some way to bring together these APIs regularly. And also an idea of what other APIs we need to support, and how to prioritise that.

Naomi: I’m talking about the Stanford take on this… These are the steps Nicholas, as project owner, does to download WARC files from Archive-it at the moment… It is a 13 step process… And this grant funded work focuses on simplifying the first six steps and making it more manageable and efficient. As a team we are really focused on not being dependent on bespoke softwares, things much be maintainable, continuous integration set up, excellent test coverage, automate-able. There is a team behind this work, and this was their first touching of any of this code – you had 3 neophytes working on this with much to learn.

We are lucky to be just down the corridor from LOCKSS. Our preferred language is Ruby but Java would work best for LOCKSS. So we leveraged LOCKSS engineering here.

The code is at: https://github.com/sul-dlss/wasapi-downloader/.

You only need Java to run the code. And all arguments are documented in Github. You can also view a video demo:

YouTube Preview Image

These videos are how we share our progress at the end of each Agile sprint.

In terms of work remaining we have various tweaks, pull requests, etc. to ensure it is production ready. One of the challenges so far has been about thinking crawls and patches, and the context of the WARC.

Q&A

Q1) At Stanford are you working with the other WASAPI APIs, or just the downloads one.

A1) I hope the approach we are taking is a welcome one. But we have a lot of projects taking place, but we are limited by available software engineering cycles for archives work.

Note that we do need a new readme on GitHub

Q2) Jefferson, you mentioned plans to expand the API, when will that be?

A2 – Jefferson) I think that it is pretty much done and stable for most of the rest of the year… WARCs do not have crawl IDs or start dates – hence adding crawl time.

Naomi: It was super useful that a different team built the downloader was separate from the team building the WASAPI as that surfaced a lot of the assumptions, issues, etc.

David: We have a CLOCKSS implementation pretty much building on the Swagger. I need to fix our ID… But the goal is that you will be able to extract stuff from a LOCKSS box using WASAPI using URL or Solr text search. But timing wise, don’t hold your breath.

Jefferson: We’d also like others feedback and engagement with the generic specification – comments welcome on GitHub for instance.

Web archives platforms & infrastructure (Chair: Andrew Jackson)

Jack Cushman & Ilya Kreymer: Thinking like a hacker: security issues in web capture and playback

Jack: We want to talk about securing web archives, and how web archives can get themselves into trouble with security… We want to share what we’ve learnt, and what we are struggling with… So why should we care about security as web archives?

Ilya: Well web archives are not just a collection of old pages… No, high fidelity web archives run entrusted software. And there is an assumption that a live site is “safe” so nothing to worry about… but that isn’t right either..

Jack: So, what could a page do that could damage an archive? Not just a virus or a hack… but more than that…

Ilya: Archiving local content… Well a capture system could have privileged access – on local ports or network server or local files. It is a real threat. And could capture private resources into a public archive. So. Mitigation: network filtering and sandboxing, don’t allow capture of local IP addresses…

Jack: Threat: hacking the headless browser. Modern captures may use PhantomJS or other browsers on the server, most browsers have known exploits. Mitigation: sandbox your VM

Ilya: Stealing user secrets during capture… Normal web flow… But you have other things open in the browser. Partial mitigation: rewriting – rewrite cookies to exact path only; rewrite JS to intercept cookie access. Mitigation: separate recording sessions – for webrecorder use separate recording sessions when recording credentialed content. Mitigation: Remote browser.

Jack: So assume we are running MyArchive.com… Threat: cross site scripting to steal archive login

Ilya: Well you can use a subdomain…

Jack: Cookies are separate?

Ilya: Not really.. In IE10 the archive within the archive might steal login cookie. In all browsers a site can wipe and replace cookies.

Mitigation: run web archive on a separate domain from everything else. Use iFrames to isolate web archive content. Load web archive app from app domain, load iFrame content from content domain. As Webrecorder and Perma.cc both do.

Jack: Now, in our content frame… how back could it be if that content leaks… What if we have live web leakage on playback. This can happen all the time… It’s hard to stop that entirely… Javascript can send messages back and fetch new content… to mislead, track users, rewrite history. Bonus: for private archives – any of your captures could eport any of your other captures.

The best mitigation is a Content-Security-Policy header can limit access to web archive domain

Ilya: Threat: Show different age contents when archives… Pages can tell they’re in an archive and act differently. Mitigation: Run archive in containerised/proxy mode browser.

Ilya: Threat: Banner spoofing… This is a dangerous but quite easy to execute threat. Pages can dynamically edit the archives banner…

Jack: Suppose I copy the code of a page that was captured and change fake evidence, change the metadata of the date collected, and/or the URL bar…

Ilya: You can’t do that in Perma because we use frames. But if you don’t separate banner and content, this is a fairly easy exploit to do… So, Mitigation: Use iFrames for replay; don’t inject banner into replay frame… It’s a fidelity/security trade off.. .

Jack: That’s our top 7 tips… But what next… What we introduce today is a tool called http://warc.games. This is a version of webrecorder with every security problem possible turned on… You can run it locally on your machine to try all the exploits and think about mitigations and what to do about them!

And you can find some exploits to try, some challenges… Of course if you actually find a flaw in any real system please do be respectful

Q&A

Q1) How much is the bug bounty?! [laughs] What do we do about the use of very old browsers…

A1 – Jack) If you use an old browser you may be compromised already… But we use the most robust solution possible… In many cases there are secure options that work with older browsers too…

Q2) Any trends in exploits?

A2 – Jack) I recommend the book A Tangled Book… And there is an aspect that when you run a web browser there will always be some sort of issue

A2 – Ilya) We have to get around security policies to archive the web… It wasn’t designed for archiving… But that raises its own issues.

Q3) Suggestions for browser makers to make these safer?

A3) Yes, but… How do you do this with current protocols and APIs

Q4) Does running old browsers and escaping from containers keep you awake at night…

A4 – Ilya) Yes!

A4 – Jack) If anyone is good at container escapes please do write that challenge as we’d like to have it in there…

Q5) There’s a great article called “Familiarity builds content” which notes that old browsers and softwares get more vulnerable over time… It is particularly a big risk where you need old software to archive things…

A5 – Jack) Thanks David!

Q6) Can you saw more about the headers being used…

A6) The idea is we write the CSP header to only serve from the archive server… And they can be quite complex… May want to add something of your own…

Q7) May depend on what you see as a security issue… for me it may be about the authenticity of the archive… By building something in the website that shows different content in the archive…

A7 – Jack) We definitely think that changing the archive is a security threat…

Q8) How can you check the archives and look for arbitrary hacks?

A8 – Ilya) It’s pretty hard to do…

A8 – Jack) But it would be a really great research question…

Mat Kelly & David Dias: A collaborative, secure, and private InterPlanetary WayBack web archiving system using IPFS

David: Welcome to the session on going InterPlanatary… We are going to talk about peer to peer and other technology to make web archiving better…

We’ll talk about InterPlanatary File System (IPFS) and InterPlanatary WayBack (IPWB)…

IPFS is also known as  the distributed web, moving from location based to content based… As we are aware, the web has some problems… You have experience of using a service, accessing email, using a document… There is some break in connectivity… And suddenly all those essential services are gone… Why? Why do we need to have the services working in such a vulnerable way… Even a simple page, you lose a connection and you get a 404. Why?

There is a real problem with permanence… We have this URI, the URL, telling us the protocol, location and content path… But when we come back later – weeks or months – and that content has moved elsewhere… Either somewhere else you can find, or somewhere you can’t. Sometimes it’s like the content has been destroyed… But every time people see a webpage, you download it to your machine… These issues come from location addressing…

In content addressing we tie content to a unique hash that identifies the item… So a Content Identifier (CID) allows us to do this… And then, in a network, when I look for that data… If there is a disruption to the network, we can ask any machine where the content is… And the node near you can show you what is available before you ever go to the network.

IPFS is already used in video streaming (inc. Netflix), legal documents, 3D models – with Hollolens for instance, for games, for scientific data and papers, blogs and webpages, and totally distributed web apps.

IPFS allows this to be distributed, offline, saves space, optimise bandwidth usage, etc.

Mat: So I am going to talk about IPWB. Motivation here is the persistence of archived web data dependent on resilience of organisation and availability of data. The design is extending the CDXJ format, with indexing and IPFS dissemination procedure, and Replay and IPFS Pull Procedure. So in an adapted CDXJ adds a header with the hash for the content to the metadata structure.

Dave: One of the ways IPFS is making changes in the boundary is in browser tab, in browser extension and service worker as a proxy for requests the browser makes, with no changes to the interface (that one is definitely in alpha!)…

So the IPWB can expose the content to the IPFS and then connect and do everything in the browser without needing to download and execute code on their machine. Building it into the browser makes it easy to use…

Mat: And IPWB enables privacy, collaboration and security, building encryption method and key into the WARC. Similarly CDXJs may be transferred for our users’ replay… Ideally you won’t need a CDZJ on your own machine at all…

We are also rerouting, rather than rewriting, for archival replay… We’ll be presenting on that late this summer…

And I think we just have time for a short demo…

For more see: https://github.com/oduwsdl/ipwb

Q&A

Q1) Mat, I think that you should tell that story of what you do…

A1) So, I looked for files on another machine…

A1 – Dave) When Mat has the archive file on a remote machine… Someone looks for this hash on the network, send my way as I have it… So when Mat looked, it replied… so the content was discovered… request issued, received content… and presented… And that also lets you capture pages appearing differently in different places and easily access them…

Q2) With the hash addressing, are there security concerns…

A2 – Dave) We use Multihash, using Shard… But you can use different hash functions, they just verify the link… In IPFS we prevent issue with self-describable data functions..

Q3) The problem is that the hash function does end up in the URL… and it will decay over time because the hash function will decay… Its a really hard problem to solve – making a choice now that may be wrong… But there is no way of choosing the right choice.

A3) At least we can use the hash function to indicate whether it looks likely to be the right or wrong link…

Q4) Is hash functioning itself useful with or without IPFS… Or is content addressing itself inherently useful?

A4 – Dave) I think the IPLD is useful anyway… So with legal documents where links have to stay in tact, and not be part of the open web, then IPFS can work to restrict that access but still make this more useful…

Q5) If we had a content addressable web, almost all these web archiving issues would be resolved really… IT is hard to know if content is in Archive 1 or Archive 2. A content addressable web would make it easier to be archived.. Important to keep in mind…

A5 – Dave) I 100% agree! Content addressed web lets you understand what is important to capture. And IPTF saves a lot of bandwidth and a lot of storage…

Q6) What is the longevity of the hashs and how do I check that?

A6 – Dave) OK, you can check the integrity of the hash. And we have filecoin.io which is a blockchain [based storage network and cryptocurrency and that does handle this information… Using an address in a public blockchain… That’s our solution for some of those specific problems.

Andrew Jackson (AJ), Jefferson Bailey (JB), Kristinn Sigurðsson (KS) & Nicholas Taylor (NT): IIPC Tools: autumn technical workshop planning discussion

AJ: I’ve been really impressed with what I’ve seen today. There is a lot of enthusiasm for open source and collaborative approaches and that has been clear today and the IIPC wants to encourage and support that.

Now, in September 2016 we had a hackathon but there were some who just wanted to get something concrete done… And we might therefore adjust the format… Perhaps pre-define a task well ahead of time… But also a parallel track for the next hackathon/more experimental side. Is that a good idea? What else may be?

JB: We looked at Archives Unleashed, and we did a White House Social Media Hackathon earlier this year… This is a technical track but… it’s interesting to think about what kind of developer skills/what mix will work best… We have lots of web archiving engineers… They don’t use the software that comes out of it… We find it useful to have archivists in the room…

Then, from another angle, is that at the hackathons… IIPC doesn’t have a lot of money and travel is expensive… The impact of that gets debated – it’s a big budget line for 8-10 institutions out of 53 members. The outcomes are obviously useful but… If people expect to be totally funded for days on end across the world isn’t feasible… So maybe more little events, or fewer bigger events can work…

Comment 1) Why aren’t these sessions recorded?

JB: Too much money. We have recorded some of them… Sometimes it happens, sometimes it doesn’t…

AJ: We don’t have in-house skills, so it’s third party… And that’s the issue…

JB: It’s a quality thing…

KS: But also, when we’ve done it before, it’s not heavily watched… And the value can feel questionable…

Comment 1) I have a camera at home!

JB: People can film whatever they want… But that’s on people to do… IIPC isn’t an enforcement agency… But we should make it clear that people can film them…

KS: For me… You guys are doing incredible things… And it’s things I can’t do at home. The other aspect is that… There are advancements that never quite happened… But I think there is value in the unconference side…

AJ: One of the things with unconference sessions is that

NT: I didn’t go to the London hackathon… Now we have a technical team, it’s more appealling… The conference in general is good for surfacing issues we have in common… such as extraction of metadata… But there is also the question of when we sit down to deal with some specific task… That could be useful for taking things forward..

AJ: I like the idea of a counter conference, focused on the tools… I was a bit concerned that if there were really specific things… What does it need to be to be worth your organisations flying you to them… Too narrow and it’s exclusionary… Too broad and maybe it’s not helpful enough…

Comment 2) Worth seeing the model used by Python – they have a sprint after their conference. That isn’t an unconference but lets you come together. Mozilla Fest Sprint picks a topic and then next time you work on it… Sometimes looking at other organisations with less money are worth looking at… And for things like crowd sourcing coverage etc… There must be models…

AJ: This is cool.. You will have to push on this…

Comment 3) I think that tacking on to a conference helps…

KS: But challenging to be away from office more than 3/4 days…

Comment 4) Maybe look at NodeJS Community and how they organise… They have a website, NodeSchool.io with three workshops… People organise events pretty much monthly… And create material in local communities… Less travel but builds momentum… And you can see that that has impact through local NodeJS events now…

AJ: That would be possible to support as well… with IIPC or organisational support… Bootstrapping approaches…

Comment 5) Other than hackathon there are other ways to engage developers in the community… So you can engage with Google Summer of Code for instance – as mentors… That is where students look for projects to work on…

JB: We have two GSoC and like 8 working without funding at the moment… But it’s non trivial to manage that…

AJ: Onboarding new developers in any way would be useful…

Nick: Onboarding into the weird and wacky world of web archiving… If IIPC can curate a lot of onboarding stuff, that would be really good for potential… for getting started… Not relying on a small number of people…

AJ: We have to be careful as IIPC tools page is very popular, but hard to keep up to date… Benefits can be minor versus time…

Nick: Do you have GitHub? Just put up an awesome lise!

AJ: That’s a good idea…

JB: Microfunding projects – sub $10k is also an option for cost recovered brought out time for some of these sorts of tasks… That would be really interesting…

Comment 6) To expand on Jefferson and Nick were saying… I’m really new… Went to IIPC in April. I am enjoying this and learning this a lot… I’ve been talking to a lot of you… That would really help more people get the technical environment right… Organisations want to get into archiving on a small scale…

Olga: We do have a list on GitHub… but not up to date and well used…

AJ: We do have this document, we have GitHub… But we could refer to each other… and point to the getting started stuff (only). Rather get away from lists…

Comment 7) Google has an OpenSource.guide page – could take inspiration from that… Licensing, communities, etc… Very simple plain English getting started guide/documentation…

Comment 8) I’m very new to the community… And I was wondering to what extent you use Slack and Twitter between events to maintain these conversations and connections?

AJ: We have a Slack channel, but we haven’t publicised it particularly but it’s there… And Twitter you should tweet @NetPreserve and they will retweet then this community will see that…

Apr 092017
 
Digital Footprint MOOC logo

Last Monday we launched the new Digital Footprint MOOC, a free three week online course (running on Coursera) led by myself and Louise Connelly (Royal (Dick) School of Veterinary Studies). The course builds upon our work on the Managing Your Digital Footprints research project, campaign and also draws on some of the work I’ve been doing in piloting a Digital Footprint training and consultancy service at EDINA.

It has been a really interesting and demanding process working with the University of Edinburgh MOOCs team to create this course, particularly focusing in on the most essential parts of our Digital Footprints work. Our intention for this MOOC is to provide an introduction to the issues and equip participants with appropriate skills and understanding to manage their own digital tracks and traces. Most of all we wanted to provide a space for reflection and for participants to think deeply about what their digital footprint means to them and how they want to manage it in the future. We don’t have a prescriptive stance – Louise and I manage our own digital footprints quite differently but both of us see huge value in public online presence – but we do think that understanding and considering your online presence and the meaning of the traces you leave behind online is an essential modern life skill and want to contribute something to that wider understanding and debate.

Since MOOCs – Massive Open Online Courses – are courses which people tend to take in their own time for pleasure and interest but also as part of their CPD and personal development so that fit of format and digital footprint skills and reflection seemed like a good fit, along with some of the theory and emerging trends from our research work. We also think the course has potential to be used in supporting digital literacy programmes and activities, and those looking for skills for transitioning into and out of education, and in developing their careers. On that note we were delighted to see the All Aboard: Digital Skills in Higher Education‘s 2017 event programme running last week – their website, created to support digital skills in Ireland, is a great complementary resource to our course which we made a (small) contribution to during their development phase.

Over the last week it has been wonderful to see our participants engaging with the Digital Footprint course, sharing their reflections on the #DFMOOC hashtag, and really starting to think about what their digital footprint means for them. From the discussion so far the concept of the “Uncontainable Self” (Barbour & Marshall 2012) seems to have struck a particular chord for many of our participants, which is perhaps not surprising given the degree to which our digital tracks and traces can propagate through others posts, tags, listings, etc. whether or not we are sharing content ourselves.

When we were building the MOOC we were keen to reflect the fact that our own work sits in a context of, and benefits from, the work of many researchers and social media experts both in our own local context and the wider field. We were delighted to be able to include guest contributors including Karen Gregory (University of Edinburgh), Rachel Buchanan (University of Newcastle, Australia), Lilian Edwards (Strathclyde University), Ben Marder (University of Edinburgh), and David Brake (author of Sharing Our Lives Online).

The usefulness of making these connections across disciplines and across the wider debate on digital identity seems particularly pertinent given recent developments that emphasise how fast things are changing around us, and how our own agency in managing our digital footprints and digital identities is being challenged by policy, commercial and social factors. Those notable recent developments include…

On 28th March the US Government voted to remove restrictions on the sale of data by ISPs (Internet Service Providers), potentially allowing them to sell an incredibly rich picture of browsing, search, behavioural and intimate details without further consultation (you can read the full measure here). This came as the UK Government mooted the banning of encryption technologies – essential for private messaging, financial transactions, access management and authentication – claiming that terror threats justified such a wide ranging loss of privacy. Whilst that does not seem likely to come to fruition given the economic and practical implications of such a measure, we do already have the  Investigatory Powers Act 2016 in place which requires web and communications companies to retain full records of activity for 12 months and allows police and security forces significant powers to access and collect personal communications data and records in bulk.

On 30th March, a group of influential privacy researchers, including danah boyd and Kate Crawford, published Ten simple rules for responsible big data research in PLoSOne. The article/manifesto is an accessible and well argued guide to the core issues in responsible big data research. In many ways it summarises the core issues highlight in the excellent (but much more academic and comprehensive) AoIR ethics guidance. The PLoSOne article is notably directed to academia as well as industry and government, since big data research is at least as much a part of commercial activity (particularly social media and data driven start ups, see e.g. Uber’s recent attention for profiling and manipulating drivers) as traditional academic research contexts. Whilst academic research does usually build ethical approval processes (albeit conducted with varying degrees of digital savvy) and peer review into research processes, industry is not typically structured in that way and often not held to the same standards particularly around privacy and boundary crossing (see, e.g. Michael Zimmers work on both academic and commercial use of Facebook data).

The Ten simple rules… are also particularly timely given the current discussion of Cambridge Analytica and it’s role in the 2016 US Election, and the UK’s EU Referendum. An article published in Das Magazin in December 2016, and a subsequent English language version published on Vice’s Motherboard have been widely circulated on social media over recent weeks. These articles suggest that the company’s large scale psychometrics analysis of social media data essentially handed victory to Trump and the Leave/Brexit campaigns, which naturally raises personal data and privacy concerns as well as influence, regulation and governance issues. There remains some skepticism about just how influential this work was… I tend to agree with Aleks Krotoski (social psychologist and host of BBC’s The Digital Human) who – speaking with Pat Kane at an Edinburgh Science Festival event last night on digital identity and authenticity – commented that she thought the Cambridge Analytica work was probably a mix of significant hyperbole but also some genuine impact.

These developments focus attention on access, use and reuse of personal data and personal tracks and traces, and that is something we we hope our MOOC participants will have opportunity to pause and reflect on as they think about what they leave behind online when they share, tag, delete, and particularly when they consider terms and conditions, privacy settings and how they curate what is available and to whom.

So, the Digital Footprint course is launched and open to anyone in the world to join for free (although Coursera will also prompt you with the – very optional – possibility of paying a small fee for a certificate), and we are just starting to get a sense of how our videos and content are being received. We’ll be sharing more highlights from the course, retweeting interesting comments, etc. throughout this run (which began on Monday 3rd April), but also future runs since this is an “on demand” MOOC which will run regularly every four weeks. If you do decide to take a look then I would love to hear your comments and feedback – join the conversation on #DFMOOC, or leave a comment here or email me.

And if you’d like to find out more about our digital footprint consultancy, or would be interested in working with the digital footprints research team on future work, do also get in touch. Although I’ve been working in this space for a while this whole area of privacy, identity and our social spaces seems to continue to grow in interest, relevance, and importance in our day to day (digital) lives.

 

Mar 142017
 

Today and tomorrow I’m in Birmingham for the Jisc Digifest 2017 (#digifest17). I’m based on the EDINA stand (stand 9, Hall 3) for much of the time, along with my colleague Andrew – do come and say hello to us – but will also be blogging any sessions I attend. The event is also being livetweeted by Jisc and some sessions livestreamed – do take a look at the event website for more details. As usual this blog is live and may include typos, errors, etc. Please do let me know if you have any corrections, questions or comments. 

Plenary and Welcome

Liam Earney is introducing us to the day, with the hope that we all take some away from the event – some inspiration, an idea, the potential to do new things. Over the past three Digifest events we’ve taken a broad view. This year we focus on technology expanding, enabling learning and teaching.

LE: So we will be talking about questions we asked through Twitter and through our conference app with our panel:

  • Sarah Davies (SD), head of change implementation support – education/student, Jisc
  • Liam Earney (LE), director of Jisc Collections
  • Andy McGregor (AM), deputy chief innovation officer, Jisc
  • Paul McKean (PM), head of further education and skills, Jisc

Q1: Do you think that greater use of data and analytics will improve teaching, learning and the student experience?

  • Yes 72%
  • No 10%
  • Don’t Know 18%

AM: I’m relieved at that result as we think it will be important too. But that is backed up by evidence emerging in the US and Australia around data analytics use in retention and attainment. There is a much bigger debate around AI and robots, and around Learning Analytics there is that debate about human and data, and human and machine can work together. We have several sessions in that space.

SD: Learning Analytics has already been around it’s own hype cycle already… We had huge headlines about the potential about a year ago, but now we are seeing much more in-depth discussion, discussion around making sure that our decisions are data informed.. There is concern around the role of the human here but the tutors, the staff, are the people who access this data and work with students so it is about human and data together, and that’s why adoption is taking a while as they work out how best to do that.

Q2: How important is organisational culture in the successful adoption of education technology?

  • Total make or break 55%
  • Can significantly speed it up or slow it down 45%
  • It can help but not essential 0%
  • Not important 0%

PM: Where we see education technology adopted we do often see that organisational culture can drive technology adoption. An open culture – for instance Reading College’s open door policy around technology – can really produce innovation and creative adoption, as people share experience and ideas.

SD: It can also be about what is recognised and rewarded. About making sure that technology is more than what the innovators do – it’s something for the whole organisation. It’s not something that you can do in small pockets. It’s often about small actions – sharing across disciplines, across role groups, about how technology can make a real difference for staff and for students.

Q3: How important is good quality content in delivering an effective blended learning experience?

  • Very important 75%
  • It matters 24%
  • Neither 1%
  • It doesn’t really matter 0%
  • It is not an issue at all 0%

LE: That’s reassuring, but I guess we have to talk about what good quality content is…

SD: I think materials – good quality primary materials – make a huge difference, there are so many materials we simply wouldn’t have had (any) access to 20 years ago. But also about good online texts and how they can change things.

LE: My colleague Karen Colbon and I have been doing some work on making more effective use of technologies… Paul you have been involved in FELTAG…

PM: With FELTAG I was pleased when that came out 3 years ago, but I think only now we’ve moved from the myth of 10% online being blended learning… And moving towards a proper debate about what blended learning is, what is relevant not just what is described. And the need for good quality support to enable that.

LE: What’s the role for Jisc there?

PM: I think it’s about bringing the community together, about focusing on the learner and their experience, rather than the content, to ensure that overall the learner gets what they need.

SD: It’s also about supporting people to design effective curricula too. There are sessions here, talking through interesting things people are doing.

AM: There is a lot of room for innovation around the content. If you are walking around the stands there is a group of students from UCL who are finding innovative ways to visualise research, and we’ll be hearing pitches later with some fantastic ideas.

Q4: Billions of dollars are being invested in edtech startups. What impact do you think this will have on teaching and learning in universities and colleges?

  • No impact at all 1%
  • It may result in a few tools we can use 69%
  • We will come to rely on these companies in our learning and teaching 21%
  • It will completely transform learning and teaching 9%

AM: I am towards the 9% here, there are risks but there is huge reason for optimism here. There are some great companies coming out and working with them increases the chance that this investment will benefit the sector. Startups are keen to work with universities, to collaborate. They are really keen to work with us.

LE: It is difficult for universities to take that punt, to take that risk on new ideas. Procurement, governance, are all essential to facilitating that engagement.

AM: I think so. But I think if we don’t engage then we do risk these companies coming in and building businesses that don’t take account of our needs.

LE: Now that’s a big spend taking place for that small potential change that many who answered this question perceive…

PM: I think there are saving that will come out of those changes potentially…

AM: And in fact that potentially means saving money on tools we currently use by adopting new, and investing that into staff..

Q5: Where do you think the biggest benefits of technology are felt in education?

  • Enabling or enhancing learning and teaching activities 55%
  • In the broader student experience 30%
  • In administrative efficiencies 9%
  • It’s hard to identify clear benefits 6%

SD: I think many of the big benefits we’ve seen over the last 8 years has been around things like online timetables – wider student experience and administrative spaces. But we are also seeing that, when used effectively, technology can really enhance the learning experience. We have a few sessions here around that. Key here is digital capabilities of staff and students. Whether awareness, confidence, understanding fit with disciplinary practice. Lots here at Digifest around digital skills. [sidenote: see also our new Digital Footprint MOOC which is now live for registrations]

I’m quite surprised that 6% thought it was hard to identify clear benefits… There are still lots of questions there, and we have a session on evidence based practice tomorrow, and how evidence feeds into institutional decision making.

PM: There is something here around the Apprentice Levy which is about to come into place. A surprisingly high percentage of employers aren’t aware that they will be paying that actually! Technology has a really important role here for teaching, learning and assessment, but also tracking and monitoring around apprenticeships.

LE: So, with that, I encourage you to look around, chat to our exhibitors, craft the programme that is right for you. And to kick that off here is some of the brilliant work you have been up to. [we are watching a video – this should be shared on today’s hashtag #digifest17]
And with that, our session ended. For the next few hours I will mainly be on our stand but also sitting in on Martin Hamilton’s session “Loving the alien: robots and AI in education” – look out for a few tweets from me and many more from the official live tweeter for the session, @estherbarrett.

Plenary and keynote from Geoff Mulgan,chief executive and CEO, Nesta (host: Paul Feldman, chief executive, Jisc)

Paul Feldman: Welcome to Digifest 2017, and to our Stakeholder Meeting attendees who are joining us for this event. I am delighted to welcome Geoff Mulgan, chief executive of Nesta.

Geoff: Thank you all for being here. I work at Nesta. We are an investor for quite a few ed tech companies, we run a lot of experiments in schools and universities… And I want to share with you two frustrations. The whole area of ed tech is, I think, one of the most exciting, perhaps ever! But the whole field is frustrating… And in Britain we have phenomenal tech companies, and phenomenol universities high in the rankings… But too rarely we bring these together, and we don’t see that vision from ministers either.

So, I’m going to talk about the promise – some of the things that are emerging and developing. I’ll talk about some of the pitfalls – some of the things that are going wrong. And some of the possibilities of where things could go.

So, first of all, the promise. We are going through yet another wave – or series of waves – of Google Watson, Deepmind, Fitbits, sensors… We are at least 50 years into the “digital revolution” and yet the pace of change isn’t letting up – Moore’s Law still applies. So, finding the applications is as exciting and challenging as possible.

Last year Deep Mind defeated a champion of Go. People thought that it was impossible for a machine to win at Go, because of the intuition involved. That cutting edge technology is now being used in London with blood test data to predict who may be admitted to hospital in the next year.

We have also seen these free online bitesize platforms – Coursera, Udacity, etc. – these challenges to trditional courses. And we have Google Translate in November 2016 adopting a neural machine translation engine that can translate whole sentences… Google Translate may be a little clunky still but we are moving toward that Hitchikers Guide to the Galaxy idea of the Babelfish. In January 2017 a machine-learning powered poker bot outcompeted 20 of the world’s best. We are seeing more of these events… The Go contest was observed by 280 million people!

Much of this technology is feeding into this emerging Ed Tech market. There are MOOCs, there are learning analytics tools, there is a huge range of technologies. The UK does well here… When you talk about education you have to talk about technology, not just bricks and mortar. This is a golden age but there are also some things not going as they should be…

So, the pitfalls. There is a lack of understanding of what works. NESTA did a review 3 years ago of school technologies and that was quite negative in terms of return on investment. And the OECD similarly compared spend with learning outcomes and found a negative correlation. One of the odd things about this market is that it has invested very little in using control groups, and gathering the evidence.

And where is the learning about learning? When the first MOOCs appeared I thought it was extraordinary that they showed little interested in decades of knowledge and understanding about elearning, distance learning, online learning. They just shared materials. It’s not just the cognitive elements, you need peers, you need someone to talk to. There is a common finding over decades that you need that combination of peer and social elements and content – that’s one of the reasons I like FutureLearn as it combines that more directly.

The other thing that is missing is the business models. Few ed tech companies make money… They haven’t looked at who will pay, how much they should pay… And I think that reflects, to an extent, the world view of computer scientists…

And I think that business model wise some of the possibilities are quite alarming. Right now many of the digital tools we use are based on collecting our data – the advertisor is the customer, you are the product. And I think some of our ed tech providers, having failed to raise income from students, is somewhat moving in that direction. We are also seeing household data, the internet of things, and my guess is that the impact of these will raise much more awareness of privacy, security, use of data.

The other thing is jobs and future jobs. Some of you will have seen these analyses of jobs and the impact of computerisation. Looking over the last 15 years we’ve seen big shifts here… Technical and professional knowledge has been relatively well protected. But there is also a study (Frey, C and Osborne, M 2013) that looks at those at low risk of computerisation and automation – dentists are safe! – and those at high risk which includes estate agents, accountants, but also actors and performers. We see huge change here. In the US one of the most popular jobs in some areas is truck drivers – they are at high risk here.

We are doing work with Pearson to look at job market requirements – this will be published in a few months time – to help educators prepare students for this world. The jobs likely to grow are around creativity, social intelligence, also dexterity – walking over uneven ground, fine manual skills. If you combine those skills with deep knowledge of technology, or specialised fields, you should be well placed. But we don’t see schools and universities shaping their curricula to these types of needs. Is there a concious effort to look ahead and to think about what 16-22 year olds should be doing now to be well placed in the future?

In terms of more positive possibilities… Some of those I see coming into view… One of these, Skills Route, which was launched for teenagers. It’s an open data set which generates a data driven guide for teenagers about which subjects to study. Allowing teenagers to see what jobs they might get, what income they might attract, how happy they will be even, depending on their subject choices. These insights will be driven by data, including understanding of what jobs may be there in 10 years time. Students may have a better idea of what they need than many of their teachers, their lecturers etc.

We are also seeing a growth of adaptive learning. We are an investor in CogBooks which is a great example. This is a game changer in terms of how education happens. The way AI is built it makes it easier for students to have materials adapt to their needs, to their styles.

My colleagues are working with big cities in England, including Birmingham, to establish Offices of Data Analytics (and data marketplaces), which can enable understanding of e.g. buildings at risk of fire that can be mitigated before fire fighting is needed. I think there are, again, huge opportunities for education. Get into conversations with cities and towns, to use the data commons – which we have but aren’t (yet) using to the full extent of its potential.

We are doing a project called Arloesiadur in Wales which is turning big data into policy action. This allowed policy makers in Welsh Government to have a rich real time picture of what is taking place in the economy, including network analyses of investors, researchers, to help understand emerging fields, targets for new investment and support. This turns the hit and miss craft skill of investment into something more accurate, more data driven. Indeed work on the complexity of the economy shows that economic complexity maps to higher average annual earnings. This goes against some of the smart cities expectation – which wants to create more homogenous environments. Instead diversity and complexity is beneficial.

We host at NESTA the “Alliance for Useful Evidence” which includes a network of around 200 people trying to ensure evidence is used and useful. Out o fthat we have a serues of “What Works” centres – NiCE (health and care); Education Endowment Fund; Early Intervention Foundation; Centre for Ageing Better; College of Policing (crime reduction); Centre for Local Econoic Growth; What Works Well-being… But bizarrely we don’t have one of these for education and universities. These centres help organisations to understand where evidence for particular approaches exists.

To try and fill the gap a bit for universities we’ve worked internationally with the Innovation Growth Lab to understand investment in research, what works properly. This is applying scientific methods to areas on the boundaries of university. In many ways our current environment does very little of that.

The other side of this is the issue of creativity. In China the principal of one university felt it wasn’t enough for students to be strong in engineering, they needed to solve problems. So we worked with them to create programmes for students to create new work, addressing problems and questions without existing answers. There are comparable programmes elsewhere – students facing challenges and problems, not starting with the knowledge. It’s part of the solution… But some work like this can work really well. At Harvard students are working with local authorities and there is a lot of creative collaboration across ages, experience, approaches. In the UK there isn’t any uniersity doing this at serious scale, and I think this community can have a role here…

So, what to lobby for? I’ve worked a lot with government – we’ve worked with about 40 governments across the world – and I’ve seen vice chancellors and principles who have access to government and they usually lobby for something that looks like the present – small changes. I have never seen them lobby for substantial change, for more connection with industry, for investment and ambition at the very top. The leaders argue for the needs of the past, not the present. That is’t true in other industries they look ahead, and make that central to their case. I think that’s part of why we don’t see this coming together in an act of ambition like we saw in the 1960s when the Open University founded.

So, to end…

Tilt is one of the most interesting things to emerge in the last few years – a 3D virtual world that allows you to paint with a Tilt brush. It is exciting as no-one knows how to do this. It’s exciting because it is uncharted territory. It will be, I think, a powerful learning tool. It’s a way to experiment and learn…

But the other side of the coin… The British public’s favourite painting is The Fighting Temorare… An ugly steamboat pulls in a beautiful old sailing boat to be smashed up. It is about technological change… But also about why change is hard. The old boat is more beautiful, tied up with woodwork and carpentry skills, culture, songs… There is a real poetry… But it’s message is that if you don’t go through that, we don’t create space for the new. We are too attached to the old models to let them go – especially the leaders who came through those old models. We need to create those Google Tilts, but we also have to create space for the new to breath as well.

Q&A

Q1 – Amber Thomas, Warwick) Thinking about the use of technology in universities… There is research on technology in education and I think you point to a disconnect between the big challenges from research councils and how research is disseminated, a disconnect between policy and practice, and a lack of availability of information to practitioners. But also I wanted to say that BECTA used to have some of that role for experimentation and that went in the “bonfire of the quangos”. And what should Jisc’s role be here?

A1) There is all of this research taking place but it is often not used, That emphasis on “Useful Evidence” is important. Academics are not always good at this… What will enable a busy head teacher, a busy tutor, to actually understand and use that evidence. There are some spaces for education at schools level but there is a gap for universities. BECTA was a loss. There is a lack of Ed Tech strategy. There is real potential. To give an example… We have been working with finance, forcing banks to open up data, with banks required by the regulator to fund creative use of that data to help small firms understand their finance. That’s a very different role for the regulator… But I’d like to see institutions willing to do more of that.

A1 – PF) And I would say we are quietly activist.

Q2) To go back to the Hitchhikers Guide issue… Are we too timid in universities?

A2) There is a really interesting history of radical universities – some with no lectures, some no walls, in Paris a short-lived experiment handing out degrees to strangers on buses! Some were totally student driven. My feeling is that that won’t work, it’s like music and you need some structure, some grammars… I like challenge driven universities as they aren’t *that* groundbreaking… You have some structure and content, you have an interdisciplinary teams, you have assessment there… It is a space for experimentation. You need some systematic experimentation on the boundaries… Some creative laboritories on the edge to inform the centre, with some of that quite radical. And I think that we lack those… Things like the Coventry SONAR (?) course for photography which allowed input from the outside, a totally open course including discussion and community… But those sorts of experiments tend not to be in a structure… And I’d like to see systematic experimentation.

Q3 – David White, UAL) When you put up your ed tech slide, a lot of students wouldn’t recognise that as they use lots of free tools – Google etc. Maybe your old warship is actually the market…

A3) That’s a really difficult question. In any institution of any sense, students will make use of the cornucopia of free things – Google Hangouts and YouTube. That’s probably why the Ed Tech industry struggles so much – people are used to free things. Google isn’t free – you indirectly pay through sale of your data as with Facebook. Wikipedia is free but philanthropically funded. I don’t know if that model of Google etc. can continue as we become more aware of data and data use concerns. We don’t know where the future is going… We’ve just started a new project with Barcelona and Amsterdam around the idea of the Data Commons, which doesn’t depend on sale of data to advertisors etc. but that faces the issue of who will pay. My guess is that the free data-based model may last up to 10 years, but then something will change…

How can technology help us meet the needs of a wider range of learners

Pleasing Most of the People Most of the Time – Julia Taylor, subject specialist (accessibility and inclusion), Jisc.

I want to tell you a story about buying LEGO for a young child… My kids loved LEGO and it’s changed a lot since then… I brought a child this pack with lots of little LEGO people with lots of little hats… And this child just sort of left all the people on the carpet because they wanted the LEGO people to choose their own hats and toys… And that was disappointing… And I use that example is that there is an important role to help individuals find the right tools. The ultimate goal of digital skills and inclusion is about giving people the skills and confidence to use the appropriate tools. The idea is that the tools magically turn into tools…

We’ve never had more tools for giving people independence… But what is the potential of technology and how it can be selected and used. We’ll hear more about delivery and use of technology in this context. But I want to talk about what technology is capable of delivering…

Technology gives us the tools for digital diversity, allowing the student to be independent about how they access and engage with our content. That kind of collaboration can also be as meaningful in the context internationally, as it is for learners who have to fit studies around, say, shift work. It allows learners to do things the way they want to do it. That idea of independent study through digital technology is really important. So these tools afford digital skills, the tools remove barriers and/or enable students to overcome the. Technology allows learners with different needs to overcome challenges – perhaps of physical disability, perhaps remote location, perhaps learners with little free time. Technology can help people take those small steps to start or continue their education. It’s as much about that as those big global conversations.

It is also the case that technology can be a real motivator and attraction for some students. And the technology can be about overcoming a small step, to deal with potential intimidation at new technology, through to much more radical forms that keeps people engaged… So when you have tools aimed at the larger end of the scale, you also enable people at the smaller end of the scale. Students do have expectations, and some are involved in technology as a lifestyle, as a life line, that supports their independence… They are using apps and tools to run their life. That is the direction of travel with people, and with young people. Technology is an embedded part of their life. And we should work with that, perhaps even encouraged to use more technology, to depend on it more. Many of us in this room won’t have met a young visually impaired person who doesn’t have an iPhone as those devices allow them to read, to engage, to access their learning materials. Technology is a lifeline here. That’s one example, but there are others… Autistic students may be using an app like “Brain in Hand” to help them engage with travel, with people, with education. We should encourage this use, and we do encourage this use of technology.

We encourage learners to check if they can:

  • Personalise and customise the learning environment
  • Get text books in alternative formats – that they can adapt and adjust as they need
  • Find out about the access features of loan devices and platforms – and there are features built into devices and platforms you use and require students to use. How much do you know about the accessibility of learning platforms that you buy into.
  • Get accessible course notes in advance of lectures – notes that can be navigated and adapted easily, taking away unnecessary barriers. Ensuring documents are accessible for the maximum number of people.
  • Use productivity tools and personal devices everywhere – many people respond well to text to speech, it’s useful for visually impaired students, but also for dyslexic students too.

Now we encourage organisations to make their work accessible to the most people possible. For instance a free and available text to speech tool provides technology that we know works for some learners, for the wide range of learners. That helps those with real needs, but will also benefits other learners, including some who would never disclose a challenge or disability.

So, when you think about technology, think about how you can reach the widest possible range of learners. This should be part of course design, staff development… All areas should include accessible and inclusive technologies.

And I want you now to think about the people and infrastructure required and involved in these types of decisions…  So I have some examples here about change…

What would you need to do to enable a change in practice like this learner statement:

“Usually I hate fieldwork. I’m disorganised, make illegible notes, can’t make sense of the data because we’ve only got little bits of the picture until the evening write up…” 

This student isn’t benefitting from the fieldwork until the information is all brought together. The teacher dealt with this by combining data, information, etc. on the learner’s phone, including QR codes to help them learn… That had an impact and the student continues:

“But this was easy – Google forms. Twitter hashtags. Everything on the phone. To check a technique we scanned the QR code to watch videos. I felt like a proper biologist… not just a rubbish notetaker.”

In another example a student who didn’t want to speak in a group and was able to use a Text Wall to enable their participation in a way that worked for them.

In another case a student who didn’t want to blog but it was compulsory in their course. But then the student discovered they could use voice recognition in GoogleDocs and how to do podcasts and link them in… That option was available to everyone.

Comment: We are a sixth form college. We have a student who is severely dyslexic and he really struggled with classwork. Using voice recognition software has been transformative for that student and now they are achieving the grades and achievements they should have been.

So, what is needed to make this stuff happen. How can we make it easy for change to be made… Is inclusion part of your student induction? It’s hard to gauge from the room how much of this is endemic in your organisations. You need to think about how far down the road you are, and what else needs to be done so that the majority of learners can access podcasts, productivity tools, etc.

[And with that we are moving to discussion.]

Its great to hear you all talking and I thought it might be useful to finish by asking you to share some of the good things that are taking place…

Comment: We have an accessibility unit – a central unit – and that unit provides workshops on technologies for all of the institution, and we promote those heavily in all student inductions. Also I wanted to say that note taking sometimes is the skill that students need…

JT: I was thinking someone would say that! But I wanted to make the point that we should be providing these tools and communicating that they are available… There are things we can do but it requires us to understand what technology can do to lower the barrier, and to engage staff properly. Everyone needs to be able to use and promote technology for use…

The marker by which we are all judged is the success of our students. Technology must be inclusive for that to work.

You can find more resources here:

  • Chat at Todaysmeet.com/DF1734
  • Jisc A&I Offer: TinyURL.com/hw28e42
  • Survey: TinyURL.com/jd8tb5q

How can technology help us meet the needs of a wider range of learners? – Mike Sharples, Institute of Educational Technology, The Open University / FutureLearn

I wanted to start with the idea of accessibility and inclusion. As you may already know the Open University was established in the 1970s to open up university to a wider range of learners… In 1970 19% of our students hadn’t been to University before, now it’s 90%. We’re rather pleased with that! As a diverse and inclusive university accessibility and inclusivity is essential for that. As we move towards more interactive courses, we have to work hard to make fieldtrips accessible to people who are not mobile, to ensure all of our astronomy students access to telescopes, etc.

So, how do we do this? The learning has to be future orientated, and suited to what they will need in the future. I like the idea of the kinds of jobs you see on Careers 2030 – Organic Voltaics Engineer, Data Wrangler, Robot Counsellor – the kinds of work roles that may be there in the future. At the same time of looking to the future we need to also think about what it means to be in a “post truth era” – with accessibility of materials, and access to the educational process too. We need a global open education.

So, FutureLearn is a separate but wholly owned company of the Open University. There are 5.6 million learners, 400 free courses. We have 70 partner institutions, with 70% of learners from outside the UK, 61% are female, and 22% have had no other tertiary education.

When we came to build FutureLearn we had a pretty blank slate. We had EdX and similar but they weren’t based on any particular pedagogy – built around extending the lectures, and around personalised quizzes etc. And as we set up FutureLearn we wanted to encourage a social constructivist model, and the idea of “Learning as Conversation”, based on the idea that all learning is based on conversation – with oursleves, with our teachers and their expertise, and with other learners to try and reach shared understanding. And that’s the brief our software engineers took on. We wanted it to be scalable, for every piece of content to have conversation around it – so that rather than sending you to forums, the conversation sat with the content. And also the idea of peer review, of study groups, etc.

So, for example, the University of Auckland have a course on Logical and Critical thinking. Linked to a video introducing the course is a conversation, and that conversation includes facilitative mentors… And engagement there is throughout the conversation… Our participants have a huge range of backgrounds and locations and that’s part of the conversation you are joining.

Now 2012 was the year of the MOOC, but now they are becoming embedded, and MOOCs need to be taken seriously as part of campus activities, as part of blended learning. In 2009 the US DoE undertook a major meta-study of comparisons of online and face to face teaching in higher education. On average students in online learning conditions performed better than those receiving face to face online, but those undertaking a blend of campus and online did better.

So, we are starting to blend campus and online, with campus students accessing MOOCs, with projects and activities that follow up MOOCs, and we now have the idea of hybrid courses. For example FutureLearn has just offered its full post graduate course with Deakin University. MOOCs are no longer far away from campus learning, they are blending together in new ways of accessing content and accessing conversation. And it’s the flexibility of study that is so important here. There are also new modes of learning (e.g. flipped learning), as well as global access to higher education, including free coures, global conversation and knowledge sharing. The idea of credit transfer and a broader curriculum enabled by that. And the concept of disaggregation – affordable education, pay for use? At the OU only about a third of our students use the tutoring they are entitled to, so perhaps those that use tutoring should pay (only).

As Geoff Mulgan said we do lack evidence – though that is happening. But we also really need new learning platforms that will support free as well as accredited courses, that enables accreditation, credit transfer, badging, etc.

Q&A

Q1) How do you ensure the quality of the content on your platform?

A1) There are a couple of ways… One was in our selective choice of which universities (and other organisations) we work with. So that offers some credibility and assurance. The other way is through the content team who advise every partner, every course, who creates content for FutureLearn. And there are quite a few quality standards – quite a lot of people on FutureLearn came from the BBC and they come with a very clear idea of quality – there is diversity of the offer but the quality is good.

Q2) What percentage of FutureLearn learners “complete” the course?

A2) In general its about 15-20%. Those 15% ish have opportunities they wouldn’t have other have had. We’ve also done research on who drops out and why… Most (95%) say “it’s not you, it’s me”. Some of those are personal and quite emptional reasons. But mainly life has just gotten in the way and they want to return. Of those remaining 5% about half felt the course wasn’t at quite the right level for them, the other half just didn’t enjoy the platform, it wasn’t right for them.

So, now over to you to discuss…

  1. What pedagogy, ways of doing teaching and learning, would you bring in.
  2. What evidence? What would consitute success in terms of teaching and learning.

[Discussion]

Comments: MOOCs are quite different from modules and programmes of study.. Perhaps there is a branching off… More freestyle learning… The learner gets value from whatever paths they go through…

Comments: SLICCs at Edinburgh enable students to design their own module, reflecting and graded against core criteria, but in a project of their own shaping. [read more here]

Comments: Adaptive learning can be a solution to that freestyle learning process… That allows branching off, the algorithm to learn from the learners… There is also the possibility to break a course down to smallest components and build on that.

I want to focus a moment on technology… Is there something that we need.

Comments: We ran a survey of our students about technologies… Overwhelmingly our students wanted their course materials available, they weren’t that excited by e.g. social media.

Let me tell you a bit about what we do at the Open University… We run lots of courses, each looks difference, and we have a great idea of retention, student satisfaction, exam scores. We find that overwhelmingly students like content – video, text and a little bit of interactivity. But students are retained more if they engage in collaborative learning. In terms of student outcomes… The lowest outcomes are for courses that are content heavy… There is a big mismatch between what students like and what they do best with.

Comment: There is some research on learning games that also shows satisfaction at the time doesn’t always map to attainment… Stretching our students is effective, but it’s uncomfortable.

Julia Taylor: Please do get in touch if you more feedback or comments on this.

Dec 052016
 
Image credit: Brian Slater

This is a very wee blog post/aside to share the video of my TEDxYouth@Manchester talk, “What do your digital footprints say about you?”:

You can read more on the whole experience of being part of this event in my blog post from late November.

It would appear that my first TEDx, much like my first Bright Club, was rather short and sweet (safely within my potential 14 minutes). I hope you enjoy it and I would recommend catching up with my fellow speakers’ talks:

Kat Arney

YouTube Preview Image

Ben Smith

YouTube Preview Image

VV Brown

YouTube Preview Image

Ben Garrod

YouTube Preview Image

I gather that the videos of the incredible teenage speakers and performers will follow soon.

 

Dec 042016
 

This summer I will be co-chairing, with Stefania Manca (from The Institute of Educational Technology of the National Research Council of Italy) “Social Media in Education”, a Mini Track of the European Conference on Social Median (#ECSM17) in Vilnius, Lithuania. As the call for papers has been out for a while (deadline for abstracts: 12th December 2016) I wanted to remind and encourage you to consider submitting to the conference and, particularly, for our Mini Track, which we hope will highlight exciting social media and education research.

You can download the Mini Track Call for Papers on Social Media in Education here. And, from the website, here is the summary of what we are looking for:

An expanding amount of social media content is generated every day, yet organisations are facing increasing difficulties in both collecting and analysing the content related to their operations. This mini track on Big Social Data Analytics aims to explore the models, methods and tools that help organisations in gaining actionable insight from social media content and turning that to business or other value. The mini track also welcomes papers addressing the Big Social Data Analytics challenges, such as, security, privacy and ethical issues related to social media content. The mini track is an important part of ECSM 2017 dealing with all aspects of social media and big data analytics.

Topics of the mini track include but are not limited to:

  • Reflective and conceptual studies of social media for teaching and scholarly purposes in higher education.
  • Innovative experience or research around social media and the future university.
  • Issues of social media identity and engagement in higher education, e.g: digital footprints of staff, students or organisations; professional and scholarly communications; and engagement with academia and wider audiences.
  • Social media as a facilitator of changing relationships between formal and informal learning in higher education.
  • The role of hidden media and backchannels (e.g. SnapChat and YikYak) in teaching, learning.
  • Social media and the student experience.

The conference, the 4th European Conference on Social Media (ECSM) will be taking place at the Business and Media School of the Mykolas Romeris University (MRU) in Vilnius, Lithuania on the 3-4 July 2017. Having seen the presentation on the city and venue at this year’s event I feel confident it will be lovely setting and should be a really good conference. (I also hear Vilnius has exceptional internet connectivity, which is always useful).

I would also encourage anyone working in social media to consider applying for the Social Media in Practice Excellence Awards, which ECSM is hosting this year. The competition will be showcasing innovative social media applications in business and the public sector, and they are particularly looking for ways in which academia have been working with business around social media. You can read more – and apply to the competition (deadline for entries: 17th January 2017)- here.

This is a really interdisciplinary conference with a real range of speakers and topics so a great place to showcase interesting applications of and research into social media. The papers presented at the conference are published in the conference proceedings, widely indexed, and will also be considered for publication in: Online Information Review (Emerald Insight, ISSN: 1468-4527); International Journal of Social Media and Interactive Learning Environments (Inderscience, ISSN 2050-3962); International Journal of Web-Based Communities (Inderscience); Journal of Information, Communication and Ethics in Society (Emerald Insight, ISSN 1477-996X).

So, get applying to the conference  and/or to the competition! If you have any questions or comments about the Social Media in Education track, do let me know.

Nov 212016
 
The band Cassia play at TEDxYouth@Manchester 2016.

Last Wednesday, I had the absolute pleasure of being part of the TEDxYouth@Manchester 2016, which had the theme of “Identity. I had been invited along to speak about our Managing Your Digital Footprint work, and my #CODI2016 Fringe show, If I Googled You, What Would I Find? The event was quite extraordinary and I wanted to share some thoughts on the day itself, as well as some reflections on my experience of preparing a TEDx talk.

TEDxYouth@Manchester is in it’s 8th year, and is based at Fallibroome Academy, a secondary school with a specialism in performing arts (see, for instance, their elaborate and impressive trailer video for the school). And Fallibroome was apparently the first school in the world to host a TEDxYouth event. Like other TEDx events the schedule mixes invited talks, talks from youth speakers, and recorded items – in today’s case that included a TED talk, a range of short films, music videos and a quite amazing set of videos of primary school kids responding to questions on identity (beautifully edited by the Fallibroome team and featuring children from schools in the area).

In my own talk – the second of the day – I asked the audience to consider the question of what their digital footprints say about them. And what they want them to say about them. My intention was to trigger reflection and thought, to make the audience in the room – and on the livestream – think about what they share, what they share about others and,hopefully, what else they do online – their privacy settings, their choices..

My fellow invited speakers were a lovely and diverse bunch:

Kat Arney, a geneticist, science writer, musician, and author. She was there to talk about identity from a genetic perspective, drawing on her fantastic new book “Herding Hemingway’s Cats” (my bedtime reading this week). Kat’s main message – a really important one – is that genes don’t predetermine your identity, and that any understanding of there being a “Gene for… x”, i.e. the “Gene for Cancer”, a “Gay Gene”, a gene for whatever… is misleading at best. Things are much more complicated and unpredictable than that. As part of her talk she spoke about gene “wobbles” – a new concept to me – which describes the unexpected and rule-defying behaviour of genes in the real world vs our expectations based on the theory, drawing on work on nematode worms. It was a really interesting start to the day and I highly recommend checking out both Kat’s book, and the The Naked Scientists’ Naked Gentics podcast.

Ben Smith, spoke about his own very personal story and how that led to the 401 Challenge, in which he ran 401 marathons in 401 days. Ben spoke brilliantly and bravely on his experience of bullying, of struggling with his sexuality, and the personal crises and suicide attempts that led to him finding his own sense of self and identity, and happiness, through his passion for running in his late 20s/early 30s. Ben’s talk was even more powerful as it was preceded by an extraordinary video (see below) of the poem “To This Day” by performance poet Shane Koyczan on the impact of bullying and the strength in overcoming it.

VV Brown, singer, songwriter, producer and ethical fashion entrepreneur, gave a lovely presentation on identity and black hair. She gave a personal and serious take on issues of identity and appropriation which have been explored (from another angle) in Chris Rock’s Good Hair (2009). As well as the rich culture of black hairdressing and hugely problematic nature of hair relaxants, weaves, and hair care regimes (including some extreme acids) that are focused on pressuring black women to meet an unobtainable and undesirable white hair ideal. She also spoke from her experience of the modelling industry and it’s incapability of dealing with black hair, whilst simultaneously happily engaging in cultural appropriation, braiding corn rows into white celebrities hair. V.V. followed up her talk with a live performance, of “Shift” (see video below), a song which she explained was inspired by the gay rights movement, and particularly black gay men in New York expressing themselves and their sexuality.

The final invited speaker was Ben Garrod, a Teaching Fellow in evolutionary biology at Anglia Ruskin University as well as a science communicator and broadcaster who has worked with David Attenborough and is on the Board of Trustees for the Jane Goodall Institute. Ben spoke about the power of the individual in a community, bringing in the idea of identity amongst animals, that the uniqueness of the chimps he worked with as part of Jane Goodall’s team. He also had us all join in a Pant-hoot – an escalating group chimp call, to illustrate the power of both the individual and the community.

In amongst the speakers were a range of videos – lovely selections that I gather (and believe) a student team spent months selecting from a huge amount of TED content. However, the main strand of the programme were a group of student presentations and performances which were quite extraordinary.

Highlights for me included Imogen Walsh, who spoke about the fluidity of gender and explained the importance of choice, the many forms of non-binary or genderqueer identity, the use of pronouns like they and Mx and the importance of not singling people out, or questioning them, for buying non gender-conforming, their choice of bathroom, etc. Because, well, why is it anyone else’s business?

Sophie Baxter talked about being a gay teen witnessing the global response to the Pulse nightclub shooting and the fear and reassurance that wider public response to this had provided. She also highlighted the importance of having an LGBT community since for most LGBT young people their own immediate biological/adoptive family may not, no matter how supportive, have a shared experience to draw upon, to understand challenges or concerns faced.

Maddie Travers and Nina Holland-Jones described a visit to Auschwitz (they had actually landed the night before the event) reflecting on what that experience of visiting the site had meant to them, and what it said about identity. They particularly focused on the pain and horror of stripping individual identity, treating camp prisoners (and victims) as a group that denied their individuality at the same time as privileging some individuals for special skills and contributions that extended their life and made them useful to the Nazi regime.

Sam Amey, Nicola Smith and Ellena Wilson talked about attending the London International Youth Science Festival student science conference, of seeing inspiring new science and the excitement of that – watching as a real geek and science fan it was lovely to see their enthusiasm and to hear them state that they “identify as scientists” (that phrasing a recurrent theme and seems to be the 2016 way for youth to define themselves I think).

Meanwhile performances included an absolutely haunting violin piece, Nigun by Bloch, performed by Ewan Kilpatrick (see a video of his playing here). As brilliant as Ewan’s playing was, musically the show was stolen by two precocious young composers, both of whom had the confidence of successful 40 year olds at the peak of their career, backed up by musical skills that made that confidence seem entirely appropriately founded. Ignacio Mana Mesas described his composition process and showed some of his film score (and acting) work, before playing a piece of his own composition; Tammas Slater (you can hear his prize winning work in this BBC Radio 3 clip) meanwhile showed some unexpected comic sparkle, showing off his skills before creating a composition in real time! And the event finished with a lively and charming set of tracks performed by school alumnae and up and coming band Cassia.

All of the youth contributions were incredible. The enthusiasm, competence and confidence of these kids – and of their peers who respectfully engaged and listened throughout the day – was heartening. The future seems pretty safe if this is what the future is looking like – a very lovely thing to be reminded in these strange political times.

Preparing a TEDx talk – a rather different speaking proposition

For me the invitation to give a TEDx talk was really exciting. I have mixed feelings about the brilliantly engaging but often too slick TED format, at the same time as recognising the power that the brand and reputation for the high quality speakers can have.

I regularly give talks and presentations, but distilling ideas of digital identity into 14 minutes whilst keeping them clear, engaging, meeting the speaker rules felt challenging. Doing that in a way that would have some sort of longevity seemed like a tougher ask as things move quickly in internet research, in social media, and in social practices online, so I wanted to make sure my talk focused on those aspects of our work that are solid and long-lived concepts – ideas that would have usefulness even if Facebook disappeared tomorrow (who knows, fake news may just make that a possibility), or SnapChat immediately lost all interest, or some new game-changing space appears tomorrow. This issue of being timely but not immediately out of date is also something we face in creating Digital Footprint MOOC content at the moment.

As an intellectual challenge developing my TEDx talk was useful for finding another way to think about my own presentation and writing skills, in much the same way that taking on the 8 minute format of Bright Club has been, or the 50 ish minute format of the Cabaret of Dangerous Ideas, or indeed teaching 2+ hour seminars for the MSc in Science Communication & Public Engagement for the three years I led a module on that programme. It is always useful to rethink your topic, to think about fitting a totally different dynamic or house style, and to imagine a different audience and their needs and interests. In this case the audience was 16-18 year olds, who are a little younger than my usual audience, but who I felt sure would have lots of interest in my topic, and plenty of questions to ask (as there were in the separate panel event later in the day at Fallibroome).

There are some particular curiosities about the TED/TEDx format versus other speaking and presentations and I thought I’d share some key things I spent time thinking about. You never know, if you find yourself invited to do a TEDx (or if you are very high flying, a TED) these should help a wee bit:

  1. Managing the format

Because I have mixed feelings about the TED format, since it can be brilliant, but also too easy to parody (as in this brilliant faux talk), I was very aware of wanting to live up to the invitation and the expectations for this event, without giving a talk that wouldn’t meet my own personal speaking style or presentation tastes. I think I did manage that in the end but it required some watching of former videos to get my head around what I both did and did not want to do. That included looking back at previous TEDxYouth@Manchester events (to get a sense of space, scale, speaker set up and local expectations), as well as wider TED videos.

I did read the TED/TEDx speaker guidance and largely followed it although, since I do a lot of talks and know what works for me, I chose to write and create slides in parallel with the visuals helping me develop my story (rather than writing first, then doing slides as the guidance suggests). I also didn’t practice my talk nearly as often as either the TED instructions or the local organisers suggest – not out of arrogance but knowing that practicing a few times to myself works well, practising a lot gets me bored of the content and sets up unhelpful memorisations of errors, developing ideas, etc.

I do hugely appreciate that TED/TEDx insist on copyright cleared images. My slides were mostly images I had taken myself but I found a lovely image of yarn under CC-BY on Flickr which was included (and credited) too. Although as I began work on the talk I did start by thinking hard about whether or not to use slides… TED is a format associated with innovative slides (they were the original cheerleaders for Prezi), but at the same time the fact that talks are videoed means much of the power comes from close ups of the speaker, of capturing the connection between speaker and the live audience, and of building connection with the livestream and video audience. With all of that in mind I wanted to keep my slides simple, lively, and rather stylish. I think I managed that but see what you think of my slides [PDF].

  1. Which audience?

Normally when I write a talk, presentation, workshop, etc. I think about tailoring the content to the context and to my audience. I find that is a key part of ensuring I meet my audience’s needs, but it also makes the talk looks, well, kind of cute and clever. Tailoring a talk for a particular moment in time, a specific event or day, and a particular audience means you can make timely and specific references, you can connect to talks and content elsewhere in the day, you can adapt and adlib to meet the interests and mood that you see, and you can show you have understood the context of your audience. Essentially all that tailoring helps you connect more immediately and builds a real bond.

But for TEDx is the audience the 500+ people in the room? Our audience on Wednesday were mainly between 16 and 18, but there were other audience members who had been invited or just signed up to attend (you can find all upcoming TEDx events on their website and most offer tickets for those that are interested). It was a packed venue, but they are probably the smallest audience who will see my performance…

The video being during the event captured goes on the TEDxYouth@Manchester 2016 Playlist on the TEDxYouth YouTube channel and on the TEDx YouTube channel. All of the videos are also submitted to TED so, if your video looks great to the folk  there you could also end up featured on the core TED website, with much wider visibility. Now, I certainly wouldn’t suggest I am counting on having a huge global audience, but those channels all attract a much wider audience than was sitting in the hall. So, where do you pitch the talk?

For my talk I decided to strike a balance between issues that are most pertinent to developing identity, to managing challenges that we know from our research are particularly relevant and difficult for young people – ad which these students may face now or when they go to university. But I also pitched the talk to have relevance more widely, focusing less on cyber bullying, or teen dynamics, and more about changing contexts and the control one can choose to take of ones own digital footprint and social media content, something particularly pertinent to young people but relevant to us all.

  1. When Is it for?

Just as streaming distorts your sense of audience, it also challenges time. The livestream is watching on the day – that’s easy. But the recorded video could stick around for years, and will have a lifespan long beyond the day. With my fast moving area that was a challenge – do I make my talk timely or do I make it general? What points of connection and moments of humour are potentially missed by giving that talk a longer lifespan? I was giving a talk just after Trump’s election and in the midst of the social media bubble discussion – there are easy jokes there, things to bring my audience on board – but they might distance viewers at another time, and date rapidly. And maybe those references wouldn’t be universal enough for a wider audience beyond the UK…

In the end I tried to again balance general and specific advice. But I did that knowing that many of those in the physical audience would also be attending a separate panel event later in the day which would allow many more opportunities to talk about very contemporary questions, and to address sensitive questions that might (and did) arise. In fact in that panel session we took questions on mental health, about how parental postings and video (including some of those made for this event) might impact on their child’s digital footprint, and on whether not being on social media was a disadvantage in life. Those at the panel session also weren’t being streamed or captured in any way, which allowed for frank discussion building on an intense and complex day.

  1. What’s the main take away?

The thing that took me the longest time was thinking about the “take away” I wanted to leave the audience with. That was partly because I wanted my talk to have impact, to feel energising and hopefully somewhat inspiring, but also because the whole idea of TED is “Ideas worth sharing”, which means a TED(x) talk has to have at its core a real idea, something specific and memorable to take from those 14 minutes, something that has impact.

I did have to think of a title far in advance of the event and settled on “What do you digital footprints say about you?”. I picked that as it brought together some of my #CODI16 show’s ideas, and some of the questions I knew I wanted to raise in my talk. But what would I do with that idea? I could have taken the Digital Footprint thing in a more specific direction – something I might do in a longer workshop or training session – picking on particularly poor or good practices and zoning in on good or bad posts. But that isn’t big picture stuff. I had to think about analogy, about examples, about getting the audience to understand the longevity of impact a social media post might have…

After a lot of thinking, testing out of ideas in conversation with my partner and some of my colleagues, I had some vague concepts and then I found my best ideas came – contrary to the TED guidance – from trying to select images to help me form my narrative. An image I had taken at Edinburgh’s Hidden Door Festival earlier this year of an artwork created from a web of strung yarn proved the perfect visual analogy for the complexity involved in taking back an unintended, regretted, or ill-thought-through social media post. It’s an idea I have explained before but actually trying to think about getting the idea across quickly in 1 minute of my 14 minute talk really helped me identify that image as vivid effective shorthand. And from that I found my preceding image and, from that, the flow and the look and feel of the story I wanted to tell. It’s not always the obvious (or simple) things that get you to a place of simplicity and clarity.

Finally I went back to my title and thought about whether my talk did speak to that idea, what else I should raise, and how I would really get my audience to feel engaged and ready to listen, and to really reflect on their own practice, quickly. In the end I settled on a single slide with that title, that question, at it’s heart. I made that the first stepping stone on my path through the talk, building in a pause that was intended to get the audience listening and thinking about their own digital identity. You’d have to ask the audience whether that worked or not but the quality of questions and comments later in the day certainly suggested they had taken in some of what I said and asked.

  1. Logistics

As a speaker there are some logistical aspects that are easy to deal with once you’ve done it a first time: travel, accommodation, etc. There are venue details that you either ask about – filming, photography, mics, etc. or you can find out in advance. Looking at previous years’ videos helped a lot: I would get a screen behind me for slides, there would be a set (build by students no less) and clear speaker zone on stage (the infamous red carpet/dot), I’d have a head mic (a first for me, but essentially a glamorous radio mic, which I am used to) and there would be a remote for my slides. It also looked likely I’d have a clock counting down although, in the end, that wasn’t working during my talk (a reminder, again, that I need a new watch with classic stand up comedy/speaker-friendly vibrating alarm). On the day there was a sound check (very helpful) and also an extremely professional and exceptionally helpful team of technicians – staff, students and Siemens interns – to get us wired up and recorded. The organisers also gave us plenty of advance notice of filming and photography.

I have been on the periphery of TEDx events before: Edinburgh University has held several events and I know how much work has gone into these; I attended a TEDxGlasgow hosted by STV a few years back and, again, was struck buy the organisation required. For TEDxYouth@Manchester I was invited to speak earlier in the year – late August/early September – so I had several months to prepare. The organisers tell me that sometimes they invite speakers as much as 6 to 12 months ahead of the event – as soon as the event finishes their team begin their search for next years’s invitees…

As the organising team spend all year planning a slick event – and Fallibroome Academy really did do an incredibly well organised and slick job – they expect slick and well organised speakers. I think all of us invited speakers, each of us with a lot of experience of talks and performance, experienced more coordination, more contact and more clarity on expectation, format, etc. than at any previous speaking event.

That level of detail is always useful as a a speaker but it can also be intimidating – although that is useful for focusing your thoughts too. There were conference calls in September and October to share developing presentation thoughts, to finalise titles, and to hear a little about each others talks. That last aspect was very helpful – I knew little of the detail of the other talks until the event itself, but I had a broad idea of the topic and angle of each speaker which meant I could ensure minimal overlap, and maximum impact as I understood how my talk fitted in to the wider context.

All credit to Peter Rubery and the Fallibroome team for their work here. They curated a brilliant selection of videos and some phenomenal live performances and short talks from students to create a coherent programme with appropriate and clever segues that added to the power of the presentations, the talks, and took us on something of a powerful emotional rollercoaster. All of us invited speakers felt it was a speaking engagement like we’d never had before and it really was an intense and impactful day. And, as Ben G said, for some students the talks they gave today will be life changing, sharing something very personally on a pretty high profile stage, owning their personal experience and reflections in a really empowering way.

In conclusion then, this was really a wonderful experience and a usefully challenging format to work in. I will update this post or add a new post with the videos of the talks as soon as they are available – you can then judge for yourself how I did. However, if you get the chance to take part in a TEDx event, particularly a TEDxYouth event I would recommend it. I would also encourage you to keep an eye on the TEDxYouth@Manchester YouTube channel for those exceptional student presentations!