Clustering the Unknown - The Youtube Case

Published on Feb 18, 2019
· DOI :10.1109/ICCNC.2019.8685364
Amit Dvir9
Estimated H-index: 9
(Ariel University),
Angelos K. Marnerides9
Estimated H-index: 9
(Lancaster University)
+ 1 AuthorsNehor Golan1
Estimated H-index: 1
(Ariel University)
Recent stringent end-user security and privacy requirements caused the dramatic rise of encrypted video streams in which YouTube encrypted traffic is one of the most prevalent. Regardless of their encrypted nature, metadata derived from such traffic flows can be utilized to identify the title of a video, thus enabling the classification of video streams into a single video title using a given video title set. Nonetheless, scenarios where no video title set is present and a supervised approach is not feasible, are both frequent and challenging. In this paper we go beyond previous studies and demonstrate the feasibility of clustering unknown video streams into subgroups although no information is available about the title name. We address this problem by exploring Natural Language Processing (NLP) formulations and Word2vec techniques to compose a novel statistical feature in order to further cluster unknown video streams. Through our experimental results over real datasets we demonstrate that our methodology is capable to cluster 72 video titles out of 100 video titles from a dataset of 10,000 video streams. Thus, we argue that the proposed methodology could sufficiently contribute to the newly rising and demanding domain of encrypted Internet traffic classification.
  • References (22)
  • Citations (1)
📖 Papers frequently viewed together
4 Authors (Marco Furini, ..., Marco Pellegrini)
16 Citations
3 Authors (Anupam Joshi, ..., R. Krishnapuram)
20 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1Ran Dubin (BGU: Ben-Gurion University of the Negev)H-Index: 5
#2Amit Dvir (Ariel University)H-Index: 9
Last. Ofer Hadar (Ariel University)H-Index: 17
view all 4 authors...
Desktops can be exploited to violate privacy. There are two main types of attack scenarios: active and passive. We consider the passive scenario where the adversary does not interact actively with the device, but is able to eavesdrop on the network traffic of the device from the network side. In the near future, most Internet traffic will be encrypted and thus passive attacks are challenging. Previous research has shown that information can be extracted from encrypted multimedia streams. This in...
9 CitationsSource
Oct 1, 2017 in LCN (Local Computer Networks)
#1Jonas Hochst (University of Marburg)H-Index: 3
#2Lars Baumgartner (University of Marburg)H-Index: 7
Last. Bernd Freisleben (University of Marburg)H-Index: 37
view all 4 authors...
To cope with the varying delay and bandwidth requirements of today’s mobile applications, mobile wireless networks can profit from classifying and predicting mobile application traffic. State-of-the-art traffic classification approaches have various disadvantages: port-based classification methods can be circumvented by choosing non-standard ports, protocol fingerprinting can be confused by the use of encryption, and current supervised learning methods for analyzing the statistical proper...
8 CitationsSource
#1Roei Schuster (TAU: Tel Aviv University)H-Index: 4
#2Vitaly Shmatikov (Cornell University)H-Index: 50
Last. Eran Tromer (TAU: Tel Aviv University)H-Index: 30
view all 3 authors...
The MPEG-DASH streaming video standard contains an information leak: even if the stream is encrypted, the segmentation prescribed by the standard causes content-dependent packet bursts. We show that many video streams are uniquely characterized by their burst patterns, and classifiers based on convolutional neural networks can accurately identify these patterns given very coarse network measurements. We demonstrate that this attack can be performed even by a Web attacker who does not directly ob...
40 Citations
Mar 22, 2017 in CODASPY (Conference on Data and Application Security and Privacy)
#1Andrew Reed (USMA: United States Military Academy)H-Index: 2
#2Michael Kranch (USMA: United States Military Academy)H-Index: 1
After more than a year of research and development, Netflix recently upgraded their infrastructure to provide HTTPS encryption of video streams in order to protect the privacy of their viewers. Despite this upgrade, we demonstrate that it is possible to accurately identify Netflix videos from passive traffic capture in real-time with very limited hardware requirements. Specifically, we developed a system that can report the Netflix video being delivered by a TCP connection using only the informa...
24 CitationsSource
Over the past few years, neural networks have re-emerged as powerful machine-learning models, yielding state-of-the-art results in fields such as image recognition and speech processing. More recently, neural network models started to be applied also to textual natural language signals, again with very promising results. This tutorial surveys neural network models from the perspective of natural language processing research, in an attempt to bring natural-language researchers up to speed with th...
307 CitationsSource
#1Ran DubinH-Index: 5
#2Amit DvirH-Index: 9
Last. Ofir TrabelsiH-Index: 2
view all 6 authors...
The increasing popularity of HTTP adaptive video streaming services has dramatically increased bandwidth requirements on operator networks, which attempt to shape their traffic through Deep Packet Inspection (DPI). However, Google and certain content providers have started to encrypt their video services. As a result, operators often encounter difficulties in shaping their encrypted video traffic via DPI. This highlights the need for new traffic classification methods for encrypted HTTP adaptive...
8 Citations
May 17, 2015 in S&P (IEEE Symposium on Security and Privacy)
#1Robert Lychev (MIT: Massachusetts Institute of Technology)H-Index: 6
#2Samuel Jero (Purdue University)H-Index: 7
Last. Cristina Nita-Rotaru (Purdue University)H-Index: 32
view all 4 authors...
QUIC is a secure transport protocol developed by Google and implemented in Chrome in 2013, currently representing one of the most promising solutions to decreasing latency while intending to provide security properties similar with TLS. In this work we shed some light on QUIC's strengths and weaknesses in terms of its provable security and performance guarantees in the presence of attackers. We first introduce a security model for analyzing performance-driven protocols like QUIC and prove that Q...
32 CitationsSource
Dec 5, 2013 in NeurIPS (Neural Information Processing Systems)
#1Tomas Mikolov (Google)H-Index: 41
#2Ilya Sutskever (Google)H-Index: 57
Last. Jeffrey Dean (Google)H-Index: 52
view all 5 authors...
The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarch...
6,418 Citations
Jan 16, 2013 in ICLR (International Conference on Learning Representations)
#1Tomas Mikolov (Google)H-Index: 41
#2Kai Chen (Google)H-Index: 24
Last. Jeffrey Dean (Google)H-Index: 52
view all 4 authors...
2,934 Citations
Jan 1, 2013 in TMA (Traffic Monitoring and Analysis)
#1Tobias Hobfeld (University of Würzburg)H-Index: 33
#2Raimund SchatzH-Index: 28
Last. Louis PlissonneauH-Index: 8
view all 4 authors...
This chapter investigates HTTP video streaming over the Internet for the YouTube platform. YouTube is used as concrete example and case study for video delivery over the Internet, since it is not only the most popular online video platform, but also generates a large share of traffic on today's Internet. We will describe the YouTube infrastructure as well as the underlying mechanisms for optimizing content delivery. Such mechanisms include server selection via DNS as well as application-layer tr...
95 CitationsSource
Cited By1
#1Amit Dvir (Ariel University)H-Index: 9
#2Angelos K. Marnerides (Lancaster University)H-Index: 9
Last. Chen Hajaj (Ariel University)H-Index: 3
view all 5 authors...
Abstract Cyber threat intelligence officers and forensics investigators often require the behavioural profiling of groups based on their online video viewing activity. It has been demonstrated that encrypted video traffic can be classified under the assumption of using a known subset of video titles based on temporal video viewing trends of particular groups. Nonetheless, composing such a subset is extremely challenging in real situations. Therefore, this work exhibits a novel profiling scheme f...