admin管理员组

文章数量:1516870

mit概率系统分析

mit概率系统分析

by Moin Nadeem

由Moin Nadeem

我们如何跟踪和分析MIT超过200,000人的足迹 (How we tracked and analyzed over 200,000 people’s footsteps at MIT)

My freshman spring, I had the pleasure to take 6.S08 (Interconnected Embedded Systems), which teaches basic EECS concepts such as breadboarding, cryptography, and algorithmic design.

大一的春天,我很高兴参加6.S08 (互连嵌入式系统),该课程教授EECS的基本概念,例如面包板,密码学和算法设计。

While the class was incredibly time-consuming and challenging, I have to say it was one of the most rewarding classes I have taken thus far. I’m proud to have worked with some incredible people (Shout out to Avery Lamp, Daniel Gonzalez, and Ethan Weber for the endless memories), and together, we built a final project that we wouldn’t forget.

尽管这门课非常耗时且富挑战性,但我不得不说这是我迄今为止参加过的最有意义的课之一。 我为与一些不可思议的人一起工作而感到自豪(向无尽的回忆向Avery Lamp,Daniel Gonzalez和Ethan Weber致敬),并且我们一起建立了一个我们不会忘记的最终项目。

For our final project, our team knew we wanted to be adventurous. While walking to get ice cream one day, Avery suggested a device to monitor WiFi probe requests, similar to how some malls do. After some initial research and persuasion towards our instructors, we decided to commit and began researching the idea.

对于我们的最终项目,我们的团队知道我们想冒险。 有一天,步行去吃冰淇淋时,艾利建议使用一种设备来监视WiFi探测请求,类似于一些购物中心的做法。 在对我们的教师进行了初步研究和说服之后,我们决定做出承诺并开始研究这个想法。

什么是WiFi探测请求? (What are WiFi Probe Requests?)

Most people consider their phone as a receiver; it connects to cellular / WiFi networks, and for all practical uses, are only functional when connected. However, when phones are searching for WiFi networks, they commonly also send out small packets of information called probe requests.

大多数人认为手机是接收器。 它连接到蜂窝/ WiFi网络,并且对于所有实际用途,仅在连接后才可以使用。 但是,当电话搜索WiFi网络时,它们通常还会发送一些小的信息包,称为探测请求

These probe requests send snippets of information such as a unique MAC address (similar to a fingerprint), RSSI signal (logarithmic signal strength), and a list of previous SSIDs encountered. As each phone will send out one MAC address (excluding recent attempts at anonymization), we can easily leverage these to track students walking around campus.

这些探测请求发送信息片段,例如唯一的MAC地址(类似于指纹), RSSI信号(对数信号强度)以及遇到的先前SSID列表。 由于每部电话都会发送一个MAC地址(不包括最近的匿名尝试),因此我们可以轻松地利用它们来跟踪在校园内散步的学生。

收集探测请求 (Collecting Probe Requests)

Requirements for our final project included using the standard 6.S08 components we used during the semester: a Teensy microcontroller, an ESP8266, and a GPS module. However, given the low power consumption of the ESP8266 (120 mA), and the lack of a need for a strong CPU, we decided to bypass the Teensy altogether. This design decision required to us to learn how to utilize FTDI programmers in order to flash an implementation of Arduino for our ESPs, but it enabled us to continue with an environment which provided strong familiarity and a breadth of libraries over the built-in AT-command firmware.

最后一个项目的要求包括使用本学期使用的标准6.S08组件: Teensy微控制器, ESP8266和GPS模块。 但是,鉴于ESP8266的低功耗(120 mA)以及对强大CPU的需求,我们决定完全绕开Teensy。 这个设计决定要求我们学习如何利用FTDI程序员 ,以便为ESP刷新Arduino的实现 ,但它使我们能够继续使用一种环境,该环境提供了对内置AT-的熟悉和广泛的库命令固件。

Within the next few days, we had a Proof of Concept which tracked probe requests made around campus; this was enough to mitigate any doubts from our professors, and it was game on.

在接下来的几天内,我们进行了概念验证,以跟踪校园内的探测请求; 这足以减轻我们教授的任何疑问,而且还在继续。

开发POC (Developing a POC)

Now that we knew enough about probe requests to proceed, our team spent the next few days writing the infrastructure which would permit us to collect these requests en-masse. I wrote a Flask + MySQL backend to manage device infrastructure + information, Avery worked on an iOS application to ease deployment of devices, Daniel Gonzalez created a beautiful frontend to our website, and Ethan created an analytics platform which transformed the wealth of incoming data into intelligible data with valuable insights.

既然我们对继续进行的探查请求有足够的了解,我们的团队将在接下来的几天中花费大量时间编写基础结构,这将使我们能够批量收集这些请求。 我编写了Flask + MySQL后端来管理设备基础架构+信息,Avery在iOS应用程序上工作以简化设备部署,Daniel Gonzalez创建了我们网站的漂亮前端,Ethan创建了分析平台,将大量传入数据转化为具有宝贵见解的可理解数据。

On a hardware side, Daniel and Ethan soldered our ESP8266s onto prototype boards, along with some power modules. We reused the PowerBoost 1000Cs given to us by the class to make these devices entirely portable, which had the nice side effect of permitting us to perform tracking in some tight locations.

在硬件方面,Daniel和Ethan将我们的ESP8266和一些电源模块焊接到原型板上。 我们重复使用了班级提供给我们的PowerBoost 1000C ,使这些设备完全可移植,其副作用是允许我们在一些狭窄的位置执行跟踪。

Notably, the team dynamic was absolutely wonderful: we laughed together, learned together, and truly enjoyed each other’s company. Deploys at 4am weren’t so bad when it’s with some of your best friends.

值得注意的是,团队动态绝对是很棒的:我们一起笑,一起学习,并真正享受彼此的陪伴。 与一些最好的朋友在一起时,凌晨4点的部署还不错。

部署中 (Deploying)

Given that Ethan wrote some nifty code to auto-connect the devices to the nearest unsecured WiFi hotspot on boot, and Avery wrote an app to update the location + last moved fields (useful for knowing which MAC addresses to associate with each location), deployment was as easy as plugging in the devices to a nearby outlet and ensuring that it was able to ping home. Deployment was quite enjoyable if you got creative with it.

鉴于Ethan编写了一些漂亮的代码以在启动时将设备自动连接到最近的不安全的WiFi热点,而Avery编写了一个应用程序来更新位置+最后移动的字段(用于知道与每个位置关联的MAC地址)就像将设备插入附近的插座并确保能够回拨一样简单。 如果您有创造力,则部署会非常有趣。

分析数据 (Analyzing the Data)

After letting the project run for a week, we collected about 3.5 million probe requests (!). I’d also like to note that the data is all anonymized; in no way is this data fine-grained enough to determine a mapping from MAC addresses to individuals, mitigating most privacy concerns our instructors had.

在将项目运行一周之后,我们收集了大约350万个探测请求 (!)。 我还要指出,数据都是匿名的。 这些数据的细粒度绝对无法确定从MAC地址到个人的映射,从而减轻了教师的大多数隐私问题。

We began by applying Ethan’s work to all of the locations, which caused immediate excitement. Our data clearly showed the periodic behavior behind each location.

我们首先将Ethan的工作应用于所有地点,这引起了立即的兴奋。 我们的数据清楚地表明了每个位置背后的周期性行为

Furthermore, it was clearly indicative of some greater trends throughout campus: major arteries (Lobby 10, 26–100) achieved peak traffic around 5pm, whereas buildings on the edge of campus (such as Stata, which has a cafe) achieved peak traffic at noon. Needless to say, with the infrastructure in place, the data becomes much more exciting.

此外,这清楚地表明了整个校园的一些趋势:主要动脉(大厅10,26-100)在下午5点左右达到了高峰流量,而校园边缘的建筑物(例如拥有咖啡厅的Stata)在下午达到了高峰。中午。 不用说,有了适当的基础架构,数据将变得更加令人兴奋。

Once we figured out that the data for these trends existed, we began to ask ourselves some more interesting questions:

一旦确定存在这些趋势的数据,便开始问自己一些更有趣的问题:

  • What could we conclude about the make + distribution of devices at MIT?

    对于麻省理工学院的设备制造和销售,我们可以得出什么结论?
  • What if we modeled our campus as a network graph?

    如果我们将校园建模为网络图怎么办?
  • Is there a most common walk?

    有最普通的散步吗?
  • More interestingly, could we predict where people will go next given their location history?

    更有趣的是,根据他们的位置记录,我们可以预测人们下一步将去哪里吗?

We proceeded to attack these one-by-one.

我们进行了一对一的攻击。

分析数据集 (Analyzing the Dataset)

MAC addresses actually provide a plethora of information in 6 bytes; we can leverage this information to determine more about the people who walk around us. For example, every manufacturer purchases a vendor prefix for each device they manufacture, and we can use this to determine the most popular devices around MIT.

MAC地址实际上提供了6个字节的大量信息。 我们可以利用这些信息来确定有关我们周围人的更多信息。 例如,每个制造商为他们制造的每个设备购买一个供应商前缀,我们可以使用它来确定MIT周围最受欢迎的设备。

But there’s also a catch — recent attempts to use this technology to track individuals by the NSA has led many manufacturers into anonymizing probe requests. As a result, we won’t be able to fully determine the distribution of devices, but we can investigate how widespread probe request anonymization is.

但是也有一个问题-NSA最近使用这项技术跟踪个人的尝试导致许多制造商匿名化了探测请求。 结果,我们将无法完全确定设备的分布,但是我们可以调查探测请求匿名化的普及程度。

It’s quite ironic that any device which anonymizes probe requests actually informs you that they do it — in anonymized devices, the locally addressed bit (second-least significant bit) of the address is set to 1. Therefore, running a simple SQL query lets us know that nearly 25% of devices anonymize MAC addresses (891,131 / 3,570,048 probe requests collected).

具有讽刺意味的是,任何对探测请求进行匿名处理的设备实际上都会通知您它们已完成操作 -在匿名设备中, 地址的本地寻址位 (第二低有效位)设置为1。因此,运行简单SQL查询可以让我们知道将近25%的设备使MAC地址匿名 (收集了891,131 / 3,570,048个探测请求)。

Running the list of vendor prefixes (first three bytes of a MAC address), we see that the top two out of top eight addresses are anonymized.

运行供应商前缀列表(MAC地址的前三个字节),我们看到前八个地址中的前两个是匿名的。

  • Locally addressed “02:18:6a”, 162,589 occurrences

    本地寻址“ 02:18:6a”,出现162,589次
  • Locally addressed “ da:a1:19”, 145,707 occurrences

    本地寻址“ da:a1:19”,出现145,707次
  • 74:da:ea from Texas Instruments, 116,133 occurrences

    来自Texas Instruments的74:da:ea,出现116,133次
  • 68:c4:4d from Motorola Mobility, 66,829 occurrences

    来自Motorola Mobility的68:c4:4d 66,829次
  • fc:f1:36 from Samsung, 66,573 occurrences

    来自三星的fc:f1:36,出现66,573次
  • 64:bc:0c from LG, 63,200 occurrences

    LG的64:bc:0c,出现63,200次
  • ac:37:43 from HTC, 60,420 occurrences

    ac:37:43来自HTC,出现60,420次
  • ac:bc:32 from Apple, 55,643 occurrences

    苹果公司的ac:bc:32,出现55 043次

Interestingly enough, while Apple is by far the largest player in probe request anonymization, they seemingly randomly send out the real address every once in a while. For someone tracking at a frequency as high as ours (nearly every second), this is problematic; we checked with iPhone-owning friends and were able to track their location with a frightening accuracy.

有趣的是,尽管苹果公司是探测请求匿名化的最大参与者,但他们似乎偶尔偶尔会随机发送出真实地址。 对于以高达我们的频率(几乎每秒)跟踪的人来说,这是有问题的。 我们与拥有iPhone的朋友进行了核对,并以惊人的准确性跟踪了他们的位置。

预测未来的位置 (Predicting future locations)

After modeling student’s walks as a network graph, we realized that we could easily calculate the probability of going to another node given the node they were previously at. Furthermore, we realized that this graph may be easily modeled as a Markov Chain. Given an initial set of vertices, where would they go next?

在将学生的走动模型建模为网络图之后,我们意识到,只要给定他们先前所在的节点,就可以轻松计算出到达另一个节点的概率。 此外,我们意识到可以轻松地将此图建模为马尔可夫链。 给定初始的一组顶点,下一个将要去哪里?

However, this posed a significant challenge: our database had little understanding of when a walk began, and when a walk ended. It was little more than a dump of coordinates with locations and timestamps. If you examined the walks manually, it was clear when some began and others ended, as the times would be quite far apart from each other.

但是,这提出了一个严峻的挑战:我们的数据库几乎不了解步行何时开始以及步行何时结束。 它不过是带有位置和时间戳的坐标转储而已 。 如果您手动检查步行路线,那么很清楚什么时候开始,什么时候结束,因为时间之间相距甚远。

This can be understood by examining the image above. For example, this individual clearly didn’t walk from Stata to the Whitaker building, since those are on different days. However, our database has no idea of this, and as any subsequent attempts to use this data would produce flawed results.

通过检查上面的图像可以理解这一点。 例如,这个人显然没有从Stata步行到Whitaker大楼,因为那是在不同的日子。 但是,我们的数据库对此一无所知,并且随后进行的任何使用此数据的尝试都会产生错误的结果

Interestingly enough, if we re-structured this as a problem of clustering time-series data, it becomes very intriguing. What if there was a way to cluster the timestamps, such that we could identify the various “walks” a student went on? Considering the recent buzz around clustering time-series data, I figured this would be a fun project to start my summer with.

有趣的是,如果我们将其重新构建为聚类时间序列数据的问题 ,它将变得非常有趣。 如果有一种方法可以对时间戳进行聚类,以便我们可以识别出学生进行的各种“行走”怎么办? 考虑到最近有关聚类时间序列数据的热议,我认为这是一个有趣的项目,可以从今夏开始。

将数据库解析为遍历 (Parsing the database into walks)

In order to best understand how to potentially cluster the data, I needed to better understand the timestamps. I began by plotting the timestamps onto a histogram to better understand the distribution of the data. Gladly enough, this simple step helped me hit pay dirt: it turns out that the frequency of probe requests with respect to distance from an ESP8266 roughly follows a Gaussian distribution, permitting us to use a Gaussian Mixture Model. More simply, we could just leverage the fact that the timestamps would follow this distribution to wring out the individual clusters.

为了最好地了解如何潜在地对数据进行聚类,我需要更好地了解时间戳。 我首先将时间戳绘制到直方图上,以更好地了解数据的分布。 足够高兴的是,这个简单的步骤帮助我付出了很多努力:事实证明,相对于ESP8266的距离,探测请求的频率大致遵循高斯分布,从而使我们可以使用高斯混合模型 。 更简单地讲,我们可以利用以下事实:时间戳将遵循此分布来绞散各个群集。

The subsequent problem was that a GMM needs to be told how many clusters to use, it won’t identify it on its own. This presented a strong issue, especially when considering that the number of walks each individual went on was highly variable. Gladly enough, I was able to utilize the Bayes Information Criterion, which quantitatively scores models with regard to their complexity. If I optimized to minimize by BIC with respect to model size, I could determine an optimal number of clusters without overfitting; this is commonly referred to as the elbow technique.

后续的问题是, 需要告知GMM使用多少个集群 ,而不能自行识别。 这带来了一个严重的问题,尤其是当考虑到每个人进行的步行次数变化很大时。 足够高兴的是,我能够利用贝叶斯信息准则 ,该准则对模型的复杂性进行了定量评分。 如果我根据模型大小通过BIC进行了最小化优化,则可以确定最佳的聚类数量而不会过度拟合。 这通常被称为肘部技术。

The BIC worked reasonably well at the beginning, but would overly penalize individuals whom went on many walks by under-calculating the number of possible clusters. I compared this with Silhouette Scoring, which scores a cluster by comparing the intra-cluster distance with the distance to the nearest cluster. Surprisingly enough, this provided a much more reasonable approach to clustering time-series data, and avoided many of the pitfalls that BIC encountered.

BIC在一开始就运作得相当好,但是会由于低估了可能出现的集群数而对走了很多步的人过度惩罚。 我将其与Silhouette Scoring进行了比较,后者通过比较群集内距离与到最近群集的距离对群集进行评分。 令人惊讶的是,这为聚类时间序列数据提供了更为合理的方法,并且避免了BIC遇到的许多陷阱。

I scaled up my DO droplet to let this run for a few days, and developed a quick Facebook bot to notify me upon completion . With this out of the way, I could get back to work predicting next steps.

我按比例放大了DO滴,使其运行了几天,并开发了一个快速的Facebook机器人,以在完成时通知我。 有了这种方式,我可以重新开始工作并预测下一步。

开发马尔可夫链 (Developing a Markov Chain)

Now that we’ve segmented an enormous string of probe requests into separate walks, we may develop a Markov Chain. This permitted us to predict the next state of events given the previous ones.

现在,我们已经将大量的探测请求细分为单独的遍历,我们可以开发一个马尔可夫链。 这使我们能够根据给定先前事件来预测事件的下一个状态。

I used the Python library Markovify to generate a Markov model given a corpus from our previous step, which shortened development time comparably.

我使用Python库Markovify生成了一个马尔可夫模型,该模型给出了上一步给出的语料库,从而大大缩短了开发时间。

I’ve included a sample Markov Chain generated above; this actually represents the walk from a student leaving lecture (26–100 is a lecture hall) and going to his dormitory! It’s really exciting that a computer was able to pick up on this and generate a similar walk. Some locations are repeated, this is because each recorded location actually represents an observation. Therefore, if a location appears more than others, that simply means that the individual spent more time there.

我已经包含了上面生成的样本马尔可夫链; 这实际上代表的是学生离开讲堂(26-100是教室)走到他的宿舍的步行路程! 计算机能够接上并产生类似的行走,真是令人兴奋。 重复某些位置,这是因为每个记录的位置实际上代表一个观测值。 因此,如果某个位置比其他位置显示更多,则仅表示该人在该位置花费了更多时间。

While this is primitive, the possibilities are quite exciting. what if we could leverage this technology to create smarter cities, work against congestion, and provide better insights into how we may be able to reduce mean walk times? The data science possibilities in this project are endless, and I’m incredibly excited to pursue them.

尽管这是原始的, 但可能性是非常令人兴奋的 。 如果我们可以利用这项技术来创建更智能的城市,应对拥堵并更好地了解我们如何减少平均步行时间,该怎么办? 该项目中数据科学的可能性是无止境的 ,而我为实现它们而感到异常兴奋。

结论 (Conclusion)

This project was one of the most exciting highlights of my freshman year, and I’m incredibly glad we did it! I’d like to thank my incredible 6.S08 peers, our mentor Joe Steinmeyer, and all the 6.S08 staff who made this possible.

这个项目是我新生一年最激动人心的亮点之一,我非常高兴我们做到了! 我要感谢我令人难以置信的6.S08同行,我们的导师Joe Steinmeyer和所有使这一切成为可能的6.S08员工。

I’ve learned a lot while pursing this project, from how to cluster time-series data, to the infrastructure required for tracking probe requests around campus. I’ve attached some more goods below of our team’s adventures.

在执行该项目的过程中,我学到了很多东西,从如何对时间序列数据进行聚类,到跟踪校园内探测请求所需的基础设施。 我在我们团队的探险活动中附加了一些其他物品。

翻译自: /

mit概率系统分析

本文标签: mit概率系统分析