# TV Archives cracked Open "AI for IA" ## Artificial Intelligence for Internet Archive ### MozFest, London Oct 2017 by [traceypooh](https://twitter.com/tracey_pooh)
https://traceypooh.github.io/mozfest17 _?_ for key shortcuts ```bash git clone https://github.com/traceypooh/mozfest17; open mozfest17/index.html ``` --- # Gist _decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings_ --- # Intro to archive.org - WayBack Machine - past copies of 300B+ pages - 15M books, lendable - ~4M videos, ~4M audio & live concerts - 3M images - 200K software items & emulation (in JS!) --- --- --- --- --- # Library! - Absolute browser Privacy - no personal data or IP addresses extracted - Validation & nontampering - keep original versions with 2+ checksums and logs ```xml commute h.264 commute.avi 1325973601 11919082 ff17ed66e7db5693dd208dd6ac488ff8 ad1df03a e9f9de8379cd25653d487ab30d198fc61a050091 115.61 480 640 ``` --- ## External Blockchain of Proofs ### of file mod times / checksums - OpenTimestamps - uses SHA-1 and Merkle trees - by Peter Todd - blog - _brand new!_ --- # archive.org/tv - recording 50 - 100 channels - 24 x 7 - around the world - since 2000 - _2 million+ news shows_ - search captions/metadata - new Trump Administration and Congress subsets - citable reference clips - Popcorn editing/mashup clips - for AI experiments --- # Artificial Intelligence - _text_: - chyron ("lower third") scanning OCR (Third Eye) - caption alignment - OCR captions from DVB-S - BBC News - speech to text (VoiceBase) - Al Jazeera English - Deutsche Welle English - _image_: - public officials facial detection
(Faceomatic <-- Matroid <-- FaceNet) --- # Artificial Intelligence - _audio_: - fingerprinting - audfprint - free/open like shazam - political Ad tracking - Duplitron 5000 --- # Public Feeds - twitter bots & TSV - Third Eye - slack bot - Faceomatic - continuous captions feed from CSPAN - https://openedcaptions.com - https://pietropassarelli.gitbooks.io/textav/projects/opened-captions-service.html --- - OCR 'lower third' - chyrons - overlaid text on broadcasts - not captions or descriptive text - editorial / summarizing in nature - 4 TV channels, 24x7, ~1 min from realtime - CNN - MSNBC - Fox News - BBC News ---
  AFTER WH MEETING, SCHUMER DISHES
  WHEN HE THOUGHT NIC WAS OFF
  
--- # bots - twitter bots - https://twitter.com/tvThirdEye - https://twitter.com/tvThirdEyeB - https://twitter.com/tvThirdEyeF - https://twitter.com/tvThirdEyeM - https://twitter.com/tvThirdEye/lists/all --- --- # API - Tab Separated Values - https://archive.org/services/third-eye.php - nice for command-line - import to google and excel spreadsheets - filtered - raw (~25MB / day) - more errors - 3rd-party filtering possible - TSV files uploaded to https://archive.org/details/third-eye --- # Chyron filtering - tesseract OCR - free; errors - simhash - groups 'nearly the same' - character flips - word off in time - look for vowels - pick 'most seen' group every minute - and tweet --- # TV AI Examples - Vox determined Puerto Rico was paid little attention by Fox News - https://vox.com/2017/10/2/16401614/fox-news-puerto-rico-charts - audio fingerprints - presented keynote paper on
CSPAN floor speeches and vocal pitch
Bryce Dietrich, UIowa - discovered 375K political Ads - find sound bites of speeches --- # clips - little JSON annotations - associate metadata to program start/end time range - auto expands each clip to a "synthetic" document - to elastic search - JSONPatch for changes - track play counts, some referers - allows for _decentralized_ annotations to other IA / research --- # clip ```json { "268.1|269.1": { "subject": [ "Criminal Activity" "Crime" ], "factcheck": [ "http://www.factcheck.org/2016/07/factchecking-trumps-big-speech/" ] }, "266.7|267.2": { "ad_id": "PolAd_DonaldTrump_d9dsn", "type": "campaign", "race": "PRES", "cycle": "2016", "message": "pro", "sponsor": [ "Republican National Cmte" ], "sponsor_type": "PAC", "subject": [ "Job Accomplishments" ], "person": [ "Donald Trump" ] }, "268.1|269.1": { "collection": [ "nancy_pelosi_archive" ], "subject": [ "Voting", ], } } ``` --- # Where We're Going - https://archive.org/details/TVNewsKitchen - want to serve journalists, researchers, librarians & more - responsible behavior and access to data - non-consumptive use --- ## [Part 2] "There Goes 2 Weeks" ## deep dive into Image Matching and
Facial Recognition

An imposter does not have Imposter Syndrome --- # CNNs - Convolutional Neural Network - filtered neural network - each layer uses output from prior layer as input - instead of rule-based learning, use classified datasets to learn - multi-node connections (but not "fully connected") - "data squashers" --- # CNN Example - feed in image - node looking for eyelash - node looking for iris - could feed to node looking for eye - meanwhile... nose node - all feed to face recognizer node - could feed to "is this Barack Obama?" --- # Guru Rik Heijdens from jwplayer - Demuxed 2017 talk - feed in video - for each _shot_, make 3 vectors: - _image_ Inception CNN (tensorflow) - _audio_ CNN spectrogram - _text_ transcripts/STT into Word2Vec - concat vectors, compare (cosine similarity), and graph - ... yields _scene detection_ - all just for ideal Ad insertion! --- # Image Matching - pixel diff algorithms (MAE, RMSE, MSE) - perceptual hashing pHash.org - image => _8x8 grayscale_ - convolve to 8x8 image with DCT - reduce to _64bit_ number - hamming distance Int64 pairs --- ### pHash - to gray 8x8


--- # TensorFlow & Training - https://www.tensorflow.org/tutorials/image_recognition - trained CNNs, locally run - GoogLeNet Inception general classifier - retrainable / customizable - redo 'top layer' (Rik idea) - https://www.tensorflow.org/tutorials/image_retraining - 2048 multi-byte vectors (floats) - iOS smaller single-byte vectors - cosine distance comparisons - can just compare vectors (and ignore readable classification labels (Rik idea)) --- # OpenFace - implementation of FaceNet - https://cmusatyalab.github.io/openface/demo-3-classifier - similar to tensorflow (Torch..) --- # OpenFace Training - 3+ images per person/face - avoid 'overfit' - align eyes + nose (nostrils?)

--- # Siamese "one shot" CNN recognizers - Rik idea - _differentiate_ instead of _classify_ - learns similarity of 2 inputs - repo / py notebook --- --- # AI Ethics - face tracking only public figures - https://www.itic.org/resources/AI-Policy-Principles-FullReport2.pdf - min. government regulation & access - public/private partner; diversity/inclusion++ - preserve human dignity, rights, freedoms - min. risk to humans; human control - large datasets -- avoid harmful bias - open discussion --- # Demo Time --- # Demo Time - Siamese network - miniARchive - tensorflow - google translate --- # help Shape US with YOUR Thoughts - extend/shape our APIs - AI ideas - research, visualizations - tag clips with AI metadata or pointers to Decentralized metadata - more! --- # Ergo _decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings_ --- # The End