2015 - Iain Smart - Burping Up Data What Your Apps Reveal About You

Name: 2015 - Iain Smart - Burping Up Data What Your Apps Reveal About You
Uploaded: 2015-10-01
Duration: 35 min 54 s
Description: Ever wondered what information your apps give away before you even authenticate? Does that photo editing app send your selfies to “the cloud”? Or does that GIF keyboard you downloaded keep track of everyone you message? This talk will present the results of ongoing research into the most popular app

BSides Manchester35:5477 viewsPublished 2015-10Watch on YouTube ↗

About this talk

Ever wondered what information your apps give away before you even authenticate? Does that photo editing app send your selfies to “the cloud”? Or does that GIF keyboard you downloaded keep track of everyone you message? This talk will present the results of ongoing research into the most popular apps in various categories of the iOS App Store (and possibly Android too, time allowing), focusing on what information they send, and where they send it to. These results will be used to identify patterns in app categories and their attitudes to security, analytics, and privacy.

Show transcript [en]

So, welcome to track 3. Now we're going to have a talk by Ian Smart, who's an intern at NCC at the moment and also is an Abertay University ethical hacker. Ian's going to be speaking about some cool stuff that he's found in apps and stuff that the apps probably should be saying. Yeah, pretty much. Over to Ian. Cool. Hi everyone. As Duncan said, I'm Ian. I'm talking about apps. I am just going into my final year of ethical hacking and countermeasures at Aberdeen University. I'm the secretary of the Average Ethical Hacking Society. Average Hackers on Twitter. I give them a follow up sometimes, there's some tweets, sometimes we forget. Organiser of Security, which is our annual conference. Quite good, I'm going to shamelessly plug it

because my conference may as well get people to buy one. I'm smart to go on Twitter. So, research. Basically we found wanted to know what apps were saying that they weren't advertising that they were doing. Everyone knows that apps are saying more than they claim. Analytics from whatever companies to aid user experiences or whatever reasons people are saying. This article is actually a really old one. It's from January last year. Angry Birds and Leaky Phone Apps Targeted by NSA and GCHQ. So, big surprise, they're trying to get information from apps. And in this they discussed people uploading photos to social networks. and finding that from the information they had, people were able to get basically

everything about you. The social media networks that they looked at, I actually can't remember which ones, were stripping all the metadata from images before it went public on the network, but it was still transmitted with the images in the first place. So if you could intercept the data, then you could get it before it was stripped by the social media. So, again, points for them to try, points to them for trying, but it didn't really work. So we decided to have a look at some of the top 10 apps in each category and see what they're saying that they really probably shouldn't be. This is going to be a fairly quick talk because actually the

research hasn't progressed quite as far as I'd hoped it would have done by this stage. It's still ongoing at the moment. We plan to cover Android and iOS, fully managed iOS. But this is what we did. Basically, we looked up the top 10 apps, made a huge spreadsheet with them all in, because the top 10 apps are actually extremely fluid. We checked what was in the top 10, the next day looked again, and in one category, four of those top 10 we couldn't even find in the store anymore. So I don't know if they were malicious that had been found to be so or whatever, but yeah, we had a big list of them, set

them up to install, Got coffee, drank coffee, realised that our wifi was really really really slow, got more coffee, tested those apps, uninstalled them and started the whole thing again. This was quite tedious as you can probably imagine. The actual testing, we proxied an iPhone through the bar, so we just set bar to run an HTTP proxy, which is worth noting, some apps we didn't find anything on, so we had to communicate with other protocols. There was no HTTP traffic but there was obviously some communication going on. So that's part of the future research is trying to find out what they're doing. Load the app, click buttons and hope stuff happens. This recon was all

completely passive. So any accounts, any apps that wanted a sign in or wanted... Yeah, basically just sign in was the only thing we didn't do. We used search functions, we would use... We'd go to login forms, we'd pretend we forgot our passwords but not submit anything. just to see what was actually being sent before you give it any information actively. Obviously once you've given them your information they can use it as they wish. And then close the app, reset phone, make sure there's nothing running in the background, clear BERT and then start again. We ended up with 1.4GB worth of BERT logs, 228 items from the... it would have been about that number of apps, there were some youth bits. Top 10 for

each of the categories. We found some really interesting stuff, not even from an analytics point of view. The next app, this was an app that was designed to teach children to read. And this is how they protected their in-app payments. Instead of using a nice password, they put three numbers. And to type these, because this is totally great in an app that teaches kids to read, there's no way they'll be able to read the numbers and then type them. Don't know how effective that is for them, I'm sure it probably works. And then we realised that we had a problem, because we had all these BERT logs, and they were quite tricky to parse. I had however many, a hundred, and I thought it was

great, you can just extract the files, because the BERT files are just zip files, so you can unzip them and then read the file that comes out. So I tried that, wrote a wee script, pulled through them, and it came back with nothing. So I rewrote it and it came back with one or two requests from each of them that were nowhere near the levels of data that we were expecting.

From the 200 odd captures we ended up pulling out about 17 requests doing it that way. And that's definitely very wrong. So the solution was to go through all of these captures manually. If anyone knows of a better way of doing this please let me know after the talk because this took ages. Each capture went through it again, selected all the items in the proxy history, saved them as an XML file, moved on to the next one, realised the part was using all my RAM, re-built the machine, did it again. This took about a week. Great fun. So, sorted. Requester on a nice XML file that we could quite easily go through. Nice and easy

to parse, because it's XML, so Python, Labyrinth, we used a nice beautiful soup for that. Should be great. Kind of, apart from because there were null bytes in a lot of the requests and responses, beautiful soup was basically crashing that way through. So then Space64 encoded, which isn't that bad, because Space64 is easy to deal with. But what we then found was that some URLs and responses were base64ed again. So we had to go through everything all over again and see if it was base64ed that time, trying to go to it, realize it wasn't, or trying to find out what was actually being said. So as I said, wrote Python script, went through all the

log files first of all, and it just went through, checked all the requests to see what they were. The first stage was to see how many hosts each app connected to, see who's talking to, see if there's any crossover between them, which is just metadata, so it doesn't really tell you anything that interesting. Roll that up, MySQL database, and save that for future processing. Some of the numbers we got from that, 8,855 requests, which is quite a lot given that we weren't running apps for that long and we weren't even signing in. Some of the apps we found were sending maybe one or two requests. And it was either a certificate of pen so you could get absolutely nothing from the apps, because BERT would obviously fail, or it

was saying no, you can't get any data until you've signed in through your Apple account, Facebook account, Twitter account, whatever they require for signing. So those were no use, which has made some of these numbers maybe slightly inaccurate because the calculations haven't taken into consideration the overall number of apps that we couldn't communicate. The total host connected to was about 620, 618 exactly there. That was quite a lot for 200 apps we thought. And then the total host app pairings. This is every time one app connected to one host that counts as a pairing. And connected to the same host again it wasn't counted anymore. What we found was a lot of apps were talking

to multiple hosts, a lot of hosts were connected to by a lot of apps. And also I realised that I don't know how to do databases. I made a typo, the average of eight is eight. Who knew? So if you ever need to know what that is. a bit more in depth details here. On average each app, 1.7 apps talk to each host. I think I may have written these slides the wrong way round. Oh no that's right. So on average each app actually talks to, I think I was the wrong way round. No that's right. Apps connected to on average 7 separate hosts when you're on them which seems quite nice to me. The highest one One app was connecting to 34 different services. I

thought maybe it's like subdomains are the same thing. So if it's a host and they've got static and a different domain for their, I don't know, CSS and one for images and so on. But it wasn't. This one was connecting to, I think it was four different CDNs, a bunch of subdomains of their own stuff, and also six or seven different analytics companies, which is excessive if you ask me. I'm not an app developer, I don't know how they're using that data, maybe they need it. The categories, these are the top, I can't remember how many there are there, 10-ish categories from the by count of hosts connected to. This is taking the top 10 as a whole, not each individual app. So again, these numbers might be a

bit skewed. I think finance actually I think should be higher up that chart based on what we saw because we actually only tested four apps in the finance section. and between 4 and they were connecting to the 54 posts which is a lot given that they were banking apps or transaction apps or whatever other finance apps there are. They seem quite noisy for what they do. Lifestyle included apps like shopping apps, a couple of magazines in there. I guess I can kind of see why they're getting content from all the places but still the jump from weather to lifestyle is quite ridiculous there. And more details. The most connected to the host that we found was iTunes.apple.com. Big surprise, we're testing on an iOS device. A

lot of them were trying to get if they had in-app purchases, that was done through the app store, and the apps would work and then through an error trying to connect to the app store to buy the product to get the in-app purchases. Or links to other apps that they had. So if the developer had multiple apps and they put adverts to it and the same thing, then they have a click here to download my other app and you click that certificate of failure. probably nothing that useful in there. The next two were analytics providers. Scorecard Research and Flurry, which is an analytics service for mobile app developers. I don't know if there's any developers in here that have used it. No? Cool. It was horrific to

deal with because all of the requests are really weirdly encoded and then You know when you go through a US data cap at binary in Linux and then you control C at halfway through and your terminal ends up in Japanese or something? That happened basically every time I script-wrapped, which made things a bit more difficult to deal with. These are the next top four. So Flurry I've just mentioned, advertising, Google Analytics, and another tracking one. All of these were connected to by multiple apps. Another part of my future research is to see what overlap there is. So if we look at the 19 connections to each of those two providers, it'd be interesting to see how many apps are speaking to both of them and how many apps

are choosing to just use one to see if there's any implicit benefit of using one over the other, if there's something that one's not providing so it's turning people towards the other one maybe, or if they just like collecting all the data they possibly can. Redundancy, I'm not really sure. So now the request that we saw, this is an average request just of an HTTP get. Nothing, it looks a bit messy but the important bit, we have a unique identifier it looks like, that looks like a UUID of some variety. Wasn't able to work out what this did but we also have, you're using an iPhone 6. Maybe not that useful information, kind of, you know, you can see what devices people are using, helps

you develop screen resolutions or whatever. version in the user agent, you see that an awful lot. That's kind of useful I guess, you know how often people are upgrading, how many people, how long you need to support old OS's. Good analytics. This was in a sports app for leaderboards and what football team is winning and who's winning what and so on. That's a lot more information at the same time. So what we find there is you've got the app name first of all, the iOS version, The next one is the action that you made in the app. Every single time you click a button, then that's set off again. User flow tracking, I guess, and then the iPhone 6 process name, version of the

iOS and the application itself, and then radio equals Wi-Fi. This one we found could tell you what Wi-Fi network, could say you're on Wi-Fi or Wi-Fi I wasn't able to test this myself, but I believe that if you're on a phone network, you can actually see the radio equals and then the phone network may. I can kind of see the use of a developer wanting to know if you're on WiFi or phone network. But I don't really see why the developers of applications need to know what network you're on and what mass you're connected to and whatever else they try and get. And also whether you're using the phone in portrait or landscape at the end there. Which, I don't know, might be useful to them.

Again, it seemed a bit excessive. Weather app here, this was another one that was quite loud. The interesting bits here, you've got the app, the server that's connecting to the name of the application, URIs, and then I have no idea what this was, it looks like a connection key of some variety, we saw an awful lot of these API keys, developers build an API and all of these are dedicated to sticking a static key in. A couple of tests showed that it was just a static key across installs across devices, so. Interesting way to go if you want to be secure, just use static keys. Probably going to work, no one's ever going to use that badly. Again, radio equals Wi-Fi. And last run and app runs, I

found them quite interesting. This changed based on the last run put in the time and date that you last used the app, and then app runs how often you've used it. So I'm assuming that's where the, you know those games you get that say "hey you've not played for a while, why not you should probably come back and play because this one I've got your cubes miss you and we'll be able to play it one." So I think Guildtrip came to play more. And then a unique visitor ID which was not persistent across installs in a lot of cases but sometimes it seemed to be so I don't know exactly how it was calculating this.

And then down the bottom, it's not actually highlighted, we have IOS at the screen reader is active as well, so they can find out whether your phone's talking to you. Ian, just a follow-up on that. Did you ever at any point track some of the other details on the phone, like the name of the device and so on, and look for that in the data? We did look for that. Why are there names? Annoyingly, Apple are actually really good at this. I think in one of these, yeah. This is a request here to try to get your Mac address. And Apple going, "Nope, not getting it. Not allowed. Can't have it." Part of the research

that we wanted to do, as I said, was Android apps. And I've got the feeling that either quite an old version of iOS or Android apps would be giving away a lot more information. Just because of the permission models, Apple I don't particularly like or use iPhones very often but I know that they have got quite good lately at refusing permissions like device ID. And actually another one that's not highlighted here, old D-ID, I'm assuming that's device ID, don't actually know but again another unavailable room now. So it looks like some of these apps are actually trying to get your device ID and send that off. This was to an ad network and actually there's

more analytics data in the ad networks than there seems to be in the Analytics. One that we found, Location. So this was an app that used a secure API to communicate with the likes of Foursquare and the other "what's in my area" services. They advertised as being secure when they did this, and as far as I can tell, they may well be secure between their server and Foursquare's servers, but whenever you say "Hey, I'm in the middle of Manchester, find me a hotel" or "find me some food", it sends that in a plain HTTP request. So that's your exact location to a reasonable number of decimal places. And then, so that, I guess you can kind of let them away with that, they're actually

wanting to use your location. This was an ad request, another big advertising company that we're getting a lot of requests to them. Nothing really, well a lot of stuff there that's encoded somehow didn't really decode it. It looks like it's just URL encoded, didn't have time to decode it. It says you're in Manchester, that makes sense, targeted advertising. Hey, go to this new restaurant in Manchester. Another ad company, the same one actually, send the location to, what's that, 15 decimal places? So they know where you are within the millimetre? Something like that? I don't know the exact numbers. I know that one decimal place can be out about a few miles, and two decimal places

can be out about, I don't know, half a mile or something. They really don't need that level of accuracy to say, "Hey, do you want to go to a shop?" This was another interesting one. Any app that had embedded maps connected to Apple's map server, great Apple Maps, I don't know if this is actually useful, I thought I'd put it in just out of interest. This seems to depend on the tiles that are being viewed. And there are some keys there that are sending tile requests. I don't know whether it could actually be useful to someone, because when you load up the app, if it was, say, a food chain and you want to know

where the nearest restaurant was, and it'll say, "Hey, the nearest one is here," and then it immediately starts pulling down map tiles based on your location. Not really, as I say, sure, it's not a thing that needs to be looked into. If you can use that to work out where someone is or the areas they're looking at going or if it's some arbitrary thing that only Apple can decode. I wouldn't be surprised either way, having seen what we've seen here. Oh yeah, this was the other one, Google Analytics. As I say, we didn't sign into any applications here, so I don't know if this change is based on your actual username. But that sent test

user, which is what I was signing in as at the time, because that's what the app by default signed in as. If it is sending your Username the whole time is probably not great, especially given that a lot of these analytics were using Certificate Pinning, great, but it was failing a little bit. So I went, "Oh no, it's not pinning Certificate Pinning." "Oh well, it's probably fine. We'll trust this guy." Yeah, an awful lot of the applications, 90% of them are probably just sending everything in plain. and the rest of them we saw a few that were going right I'll use HTTPS because it's secure and we'll do it properly and then just kind of ignoring the certificates. Big red oh no it's a button that I shouldn't click

and just go yeah. So yeah I really have flown through this, this was a quick talk. Some thoughts that we came up with. Apps are talking an awful lot, as I said about 90,000 requests from the top couple hundred apps and that's not even from all the ones we talked about and we couldn't intercept the requests from because the worst some were doing things properly. There's a lot of behind the scenes stuff going on, there's an awful lot that goes on we couldn't see. I was speaking to Adam about this actually and he came up with the point that maybe the analytics companies are securing things so that people like us can't see what's going

on. It's not so that your data's, you know, they're stealing all your data but they want to be careful about it. It's so that people that are gonna set poke about when I find out what it's doing can actually find out what it's saying about you. So that we don't turn around and go, "Oh, look at them, they're stealing our data, they're evil people." Some requests, we can see that they're obfuscated in ways that I didn't have time to try and work out. Some of them are encoded through, we found a few that were sending what looked like zip files. Base64 and then Base64 again. So another thing on the to-do list is to go

through them and see what's actually in the zips. I wouldn't be surprised at all, there's a lot of information. We found one really bizarre one that the developers had obviously been told "hey, don't use a CSV for your API anymore because it's really old and no one uses CSV anymore, you should use JSON." So all they did was wrap their CSV stuff in JSON and then split it out into JSON and then parse the CSV. So yeah, they were using JSON. Great. Good job. Some apps do clever things to get around the detection of analytics. A lot of the analytics data was in get requests. Some of it was all encoded, some of them were sending in batches, so instead of contacting every time you did something, it would

wait until you'd done 10 clicks, and then send a whole load of them in a nice bundle that could then be processed and split up. We found one that instead of sending anything, they went "Look, we're sending no parameters, it's great, we're not sending up any information." They put literally everything in the user agent. So the user agent was about 3 or 4 thousand characters long and it sent off your, more than just on and off IOS, it sent off your location, your ID, everything you've just done and all that sort of thing. So it must have been using a dynamic user agent which is a bit weird and probably doesn't help the server deal

with it very much. Future research. More digging is definitely required. This was a project that got delayed because of the difficulties with parsing files. and unfortunately only really took off about last Friday to start getting information. There's an awful lot of data I've got that I can't actually make use of yet, but will be looking into in the future. A more effective way of pulling out that information would be great, my script was thrown together, I'm the king of writing Python scripts that kind of just work until you give them something unusual, then they break, cry, then I cry. So a nice, automated way of pulling out the information would be really good to get. Checking Android apps, as I said, the Android permissions model up until, what's the

word I call it, Merch Now, the next build, was kind of broken. And if you didn't want to give an app permissions, you just didn't install it. And also, I don't know what information apps are allowed to access, like MAC address or Wi-Fi host name, phone network, etc. It would be quite interesting to find that out. Speaking of that, some apps I reckon have a fail when they can't get your network because I was using an iPhone that didn't have a SIM card in it and yet I had a request that said I was on O2 and a request that said I was on Vodafone and I think I had one of the other phone networks as well which wasn't true because there was no SIM card but I

think some of them just kind of go yeah, they're probably on the network, we'll just guess one. Not entirely sure what was going on there. Again, if anyone's come across it please let me know because I'm curious. That's it, I really have flown through that. As I say, if anyone's got any questions I've got a point and then I've got a question. So in your GET request, where it said MAC address unavailable, I believe it also said the RIAOS version was 8.1? So in IOS 6 they would hire some of the APIs to access some of the specific APIs in IOS and they went even further, specifically designed to stop ad networks tracking people. So that's probably why that is, I expect if you go back to

older versions of IOS it will work and will be able to do metrics. When you were talking about the applications that were currently open to the SSL clinic, did you report that to the Dendro? We've not done yet, as I said, this research, we've basically managed to actually look at what's happening very recently. There's a lot of, as I said, it's an ongoing project and I'll need to see what, because this has been done as NTC research, I'll need to go to them and sort that out. Reporting these issues. And then also we need to verify games before we do these tests briefly once on one phone. So we need to check again and I didn't have time to go through it. It took about an hour

and a half to install 10 apps each time. That's why there was an awful lot of copy and paste. Yeah? Yeah, I'm just talking about the cell certificate checking, TLS. Yeah, hopefully. I don't have the numbers on me, I do have access on my laptop to the numbers that were successfully circled. It's this kind of, we refined our methodology as we went on, I think. can really be done and being done again right from the start. You kind of get halfway through research and then go, "I wish I'd been recording that the whole time." So if I were to do it again, that would definitely be one of the things that would be interesting to look into. I just think that the one

you had the question saying about when it invents a network for you. Yes. If the device has recorded the previous ICCID number from a SIM card that's been in it, it might associate that with a particular network. Right. That could be what it was doing. As I said, I don't know. I just know it sometimes appears to be randomly guessing. But then that wouldn't explain why different apps were sent in different networks. No. So, yeah, that's what really... If it consistently said O2, great, the phone's been used on O2. But can you just guess? Is the phone connected to a network for emergency purposes? No. If it's not a SIM card it's still useful. Yeah,

if there are two numbers it still works. Right. It could have been that. It may have been just pulling up a list of all the other members and then picking the first one or something. Or if the base phone was linked to a particular network maybe. That could also make sense. I know this talk was kind of quick, but part three was to encourage people as well to just show how easy it is to get some form of information out of apps. It's really, really easy to tunnel your phone through Burp. I know that two months ago I didn't know it was possible. I should have done it, but I'd never seen it done before. I decided the person who did it was a wizard. So yeah, it's kind

of just an interesting one to see how easily accessible that information is. No. As I said we were doing everything on device and we weren't even BERT was doing nothing more than intercepting and recording traffic, so it wasn't sending APIs to not go on packets or anything like that. On the subject of APIs, people really need to think of a good way of picking versioning numbers. Most of them are as bad as Valve and can't count to three. A few of them are using three, one of them is using four, one of them is using API version 7.3 point something. So you look at any grid in your API six times, you kind of just realize that you shouldn't stick with one. because it wasn't

a company that had been around all that long. Just a question around the obfuscation that we're seeing with a number of the requests. This type of behaviour, typically is very layered, basic obfuscation techniques. It's something that we see quite a lot in malicious software as well. Yes. And a lot of malware offers are using that around distribution and delivery of the payload and so on. Given the interest that a lot of the authors of malware have in the mobile space at the moment, because of the popularity of it, what do you think was the scariest technique that you saw the internet advertising and analytics guys using that malware authors might be able to put to use? I think the double base 64ing and then zipping was quite

an interesting one. Because it was sending legitimate data that it had legitimately acquired from the device, I didn't realize it was a zip at all until someone else that had seen it before pointed it out. So again, it's one of these things that once you've seen it, people are going to be aware of it. Putting things in the user agent was quite an interesting one. because at the start I was only looking for... I got burped a filter by any request that I had for answers so obviously I would have missed that user agent one at first and it was just by complete coincidence I controlled C at a random point in the program and

went "if user agent, that doesn't look right" That's why it was sticking out to me as well because they're both techniques that are being used on malware I didn't really come across anything else that I particularly thought of as "oh that would be great in malware" but I've done no work in malware analysis at all Good question. Yes. I'm looking at it sort of more where I hadn't even considered that. The first thing that I would be suggesting is ad blocks of some variety, but they obviously require, I believe they require jailbroken or rooted forms. I don't think there's any that you can get legitimately and you don't really want to be encouraging absolutely everyone to go around jailbreaking their forms because that would end well. Other than that

I actually can't think of anything off the top of my head. It would be another thing of interest to research. It is something I'm more interested in Android apps and like you say, it's pretty much, "Yes, I trust this app to do what it says it's not involved." Or, "I'm not going to use this app." I believe that's changing in whether the next version of it is something that we keep in track. It is supposed to be, but all part of this is, "Can you actually trust the app to do what it says it's going to do?" This research was also done as part of a group who are looking into general API use. So

they may have some other ideas as well, but it's not something that I've looked at. I think one thing to call out in that space is that the value in this data to these analytics firms and monitor companies, if the route to get that is stopped, they're not just going to walk away and walk away from all the money they've been earning. They're going to find another way of getting access to the data that they need. So something else just following from that, it would be interesting if you could get a budget to see if paid apps in particular, if there's a certain category, you know, so if you spend more money on an app,

are they more or less targeted than maybe like a free app where you kind of know you're paying for the app with your information, you know, so it's kind of a given, but maybe there might be some value in So as I said, the methodology for this was definitely refined as we went. It started with here's a phone, here's a laptop, go do some research. And it did get better as we go. It would be great to be more thorough and more scientific in the way we test. And also, as you say, get better samples from each area and actually record, as we mentioned, which apps are actually circling, which apps are failing open. and

whether there are any changes between free and paid apps. It would definitely be interesting. It's definitely on the further research list. - One more question. - What made you pick Apple? You say you don't really like Apple yourself, and I'm sure it's probably got more holes in it than whatever. Why? - We thought this research would go quicker than it did. So I did iPhone first and planned on moving on to Android. and I realised that if I wanted to actually have anything to say at this talk rather than "Hey, here's 800 BIRC files, you guys can go through them yourselves" I had to get one done. But the scripts I've written will work for,

because it's just BIRC files, it will work on the Android apps. So it's now just a case of downloading them, running them, doing the same thing again. So if any of you want to download 10 apps, put them through BIRC and send me them, great. Cool. Thanks for listening guys.

2015 - Iain Smart - Burping Up Data What Your Apps Reveal About You

Related talks