Juicy: Notorious B.I.G. Data

This week, Joel is out and Jon is in! He catches up with a fellow Appvia colleague, Chris Nesbitt-Smith, to talk about all things big data in cloud. It's as confusing as it is big... so buckle up.

To keep the conversation going, join our Slack community and follow along on Twitter


Welcome to Cloud Unplugged, the podcast where we take a light-hearted and honest look at the who, what, when, where, why, how and omgs of cloud computing. In today’s episode, Jon is joined by Chris Nesbitt-Smith, a friend and co-worker of ours to discuss big data and cloud a subject that is as confusing as it is large. So with that, let’s get directly to the interview.

So I’m here with Chris. So welcome to the podcast. A bit of a change, it’s not Joel. I was going to fake an American accent, but I thought that was maybe going a bit too far, maybe slightly insulting to the US listeners. So avoided that temptation. Chris, do you want to introduce yourself? like what you do, bit of your background?

Sure. So I’m Chris Nesbitt-Smith. I’ve spent time professionally I guess working in all sorts of different industries. Between like card fraud, central government and banking media, IoT, print – all sorts. And yeah, generally a developer at heart at the crux of it, but generally end up kind of focusing on helping organisations hit some business value driven transformation goals. So avoiding some of the kind of typical technical navel gazing that you can often find yourself trapped in. And looking at how you can actually move the dial genuinely and do that based on some business value.

Chris was just looking at me when he said navel gazing directly. Slightly worrying though, you really locked eyes on me when you said that. So, not sure what to think! But yeah, that’s really good. Chris and I have worked together historically on a lot of this stuff, either very directly and closely and then not so closely over time. But very much in the DevOps to space. I think when I met you, you were an engineer on a project at the time doing node and some other stuff.

Yeah, the technical lead on a thing. And ended up kind of drifting, I guess, into DevOps type space, with trying to actually make it kind of testable, repeatable type type things, which DevOps almost is. It’s not, it’s not quite away from like, sysadmin-erry where it’s kind of log into a box and do some things, and no idea what happened later. But yeah, kind of getting closer to that. And it kind of feels a bit more kind of repeatable nowadays.

Yeah, so we’re gonna speak about big data and cloud. 

Which is a niche, small topic.

Just a really tiny topic just to pick at. But I think through the history, I know you’re working on some big data stuff. At the time. I mean, we can kind of come on to that. But I suppose from, from your perspective, when you hear these terms, big data, and people doing big data? What do you think it means to most people? What would be your definition of it for yourself? And for others,

I mean, it’s a super woolly nonsense term. And what you would have called Big Data a few years ago is probably not. And there’s very much difference between kind of cumbersome data like, which should be probably used more as a term because it’s a bit awkward, it’s a bit difficult to kind of move around, and it’s in the watsapp. But it doesn’t fundamentally fall into the category of where you need to deal with like a big data problem. So typically if it’s on prem stuff, then you might have seen people like rolling out like Hadoop and stuff on a local cluster and kind of ploughing away with that. But nowadays, with compute being what it is, it’s a bit more commodity to be able to do that, with simpler tech generally. And thinking about how you replicate rather than a single thing that does everything. Thinking about how you distribute and stuff like that. So I guess, big data when it comes to cloud type stuff, it’s when you need to start thinking about how you can it’s become a separate concerns out. And like it’s off like cloud normally terms that come into the scope of kind of high end of terabytes, and petabytes exabyte type scale. If you’ve got gigabytes worth of data, then that doesn’t really for you into that type of category. But everyone generally kind of feels like they’ve got a genuine kind of big data problem and it’s the same sort of buzzword and stuff as  AI and machine learning and things like that. Which like, sure, but like, is it really? Or is it just actually can some hype buzzword? That doesn’t really apply?

Yeah, and I guess maybe a lot of those things are around data we see when it comes to kind of modelling and learning, obviously, it’s data-based. Not database, but databased. But when you’ve worked on things, because I know you worked on certain projects, where … I guess there’s two things usually, so I’ll kind of pick on them independently and get your opinion on them. One is, as a company, when it gets the data, people really start to care. You can get fined massive amounts, you know, there’s data leaks. Nobody wants any sensitive assets leaked either. So when it gets to the data, that’s usually a sensitive topic, that’s when security start to get involved quite heavily. And the second bit of that is the cost. Obviously, the more data you have, the more expensive it’s probably going to be in cloud. I know, you’ve worked on these things. So one is a little bit of how you’ve seen this go down, when it gets to the cost side of how to do it cost effectively. And the other one is how you manage security when you’re going to cloud with a lot of data if some of that data is sensitive. And then when it’s in the bucket of being big data, which seems you know, whatever that means. How companies struggle, I guess, approaching data, and then getting to cloud and all this other stuff. And I know you’ve had first hand experience of going on this journey in certain companies

Yeah, so on the cost side, well, data is often like termed in a buzzwordy world of like being the new oil or whatever is a commodity thing that has value in itself. And yeah, absolutely, particularly in Europe with GDPR that’s a real thing where you get fined by a governing body when you get it wrong. And you can end up in worlds where if there’s demonstrable length negligence, then the fines can step over any of the limited liability of an organisation and end up on directors, which is good if you’re a consumer. If you’re actually one of those people on the board that’s put their name down as being responsible for it. It’s a bit terrifying. And naturally, because ultimately, you end up putting people’s houses on the line. Just because they work somewhere yeah absolutely, those people should want to understand how things work. And want to understand, actually what the risk is that they’re inherently kind of signing up to? 


Oh, and of course, PCI as wel where you can get slapped by your card issuer if you leak like a load of credit cards, then you’re not gonna be very popular, even if you just do that and don’t leak anything else. So, yeah, there’s genuine and legitimate concerns to consider. The other kind of big thing about like security and public cloud is it changes the frame. When you were running things on prem, then you could be a bit more reactionary, because inherently, you’re probably not on a multi gigabit network connection to the outside world. So your how many gigabytes worth of data that is your entire data set would take some time to egress. And that will leaveyour estate, and leave the building in one way or another, whether that leaves on a USB stick, or whether it leaves because someone’s emailed a great big payload somewhere. That takes time and it’s a bit more obvious to kind of see that just by observing the amount of network traffic that looks unusual. In cloud, obviously, you’re in a world where speed depending on what your how it’s actually implemented, that can be obtained can all be over within fractions of a second from your initial breach to losing all of the data. Or all confidence in the integrity of the data can happen within fractions of a second, which is terrifying.

People listening in, slowly their eyes are getting wider as you kind of speak and kind of worrying about their data as they listen in.

Yeah I mean, it’s legitimate concerns and to think about like how that works. But the other side of it isn’t, and the flip side to to putting stuff in public cloud is that you’re depending the technologies that you use the you’re putting a lot of onus on the cloud vendor like in the shared responsibility model to be responsible for a lot of those things that you would have done. So if you were running it, say, a SQL Server locally, where you’re responsible for patching both the iOS and the database server and everything else. And all the way up the stack, because so any of the recent vulnerabilities that we’ve seen in, say, windows, if it’s a Windows SQL Server, you’ve got to be on top of doing that constantly. And that is expensive and risky. And there’s plenty of kind of vulnerabilities that come out over the last weeks and months and years, that would expose someone rather concerned me if you’ve not really nailed and done a lot of investment in, in what that configuration is to lock it down. And still, actually probably still keep you up at night,

occasionally. But they do they do lend themselves is kind of the challenge, isn’t it a bit is when you like traditionally you’d have other engineers specialisms getting involved. And we’ve all known the ring fencing, you know, let’s put a boundary security boundary because all of our stack, we’re on prem is fine. We’ve got a security boundary. We’ve got gateways that are intercepting packets, and you know,

yeah, it was eggshell mentality. Yeah. I mean, as soon as you get through that initial perimeter, then there’s a week, it’s all gooey and lovely. And lateral movement happens easily. So you only needs one chink in that proverbial kind of armour. Yeah. And you’re in, and then you’ve got everything, yeah. To be able to kind of laterally move around, which is terrifying. Whereas if you can delegate that, make that someone else’s problem for some of it, yeah. And dictate what you want by policy that you’re applying, then. I mean, if you’re moving to cloud, then you’re probably not going to just lose all of a sudden, all of the specialism and expertise that you had in house already, like they’re still going to be existing for you as you’re migrating around. And if you can leverage them, well, then you can get the guests to articulate what their concerns are, and, and apply them in a different way to how you’d mitigate against those concerns. So things like data leak protection, then you can alerts and and put a process in place that mitigates their concerns, but possibly in a different way to how they’ve done it before.

Yeah, I think that’s I think that’s really important, because it is not that just because you I mean, people’s inertia for cloud is because there’s that false sense of security that you kind of get when you’re on prem and things not being public. You know, I know, it’s kind of Ingo, there’s my data and point Yeah, exactly. There’s the psychology behind it, of it feeling more safe. Sure. And then there’s then moving it somewhere else. And the, and the parameters are totally different. And it is public, and it has public concepts. And it’s called public cloud, you know, you can kind of understand why people start to freak out around. So my data is going to a public cloud. And I don’t want my data public. I guess that’s the concern. But you’re right, in the sense of, you can still do security, if not better. in the cloud, it’s just a different approach. It’s not that it’s a one or the other.

Yeah, obviously. I mean, if you do it right, then yeah, it’s significantly better, because she easy, the cloud vendors have far more resources than any organisation does to be able to implement the, the processes and segregation of duty. I mean, we’d often in learning size organisation, people talk about like segregation of duty between different people. But ultimately, if it’s like a Microsoft shop, or whatever, then someone’s got domain control somewhere, you have physical access to a thing. And you can lose the whole thing. It’s all about, I guess, thinking about the cost of what it would take someone if they were to target you. And what, what their actual expenses, I mean, ultimately, it’s going to point a gun at someone’s head and are getting to type in their credentials, and that there is a price that you can put on doing that, as I think it’s, I’m sure you can find plenty of kind of dark web ways of doing that and Jesus Christ.


please don’t do that. Don’t do.

But consider that there is a price tag on it. And then it’s proportionately mitigating against that. And as long as doing long as what the reward to that is worth more than the doing it then yeah, you’ve made that enough of a deterrent, but it’s not attractive to do. Yeah. But yeah, like human beings will ultimately always be your kind of weakest link of whether they’re like sharing credentials to make things easy and slack. Yeah. Or in an email or whatever. Or legitimately doing their job and their machine gets, I can then use the device gets gets hacked, because they’re doing whatever on their end device. I mean, there’s been plenty of things cited around, like professional gamblers having machines taken, or laptops taken out of hotel rooms and all sorts. I mean, it’s, it’s a real thing. And it does really happen. So it’s, it’s thinking about that, but that’s the same world as if it’s on prem. Yeah, wherever it is thinking about those controls. But when cloud you can put in place policy and effectively lock yourself out of the data. So it’s not like having full admin creds, you can actually kind of manage that a bit better and do much more in the way of alarm bells, if anything, actually is wrong. And looks like it’s going awry.

Yeah, that’s, that’s, that’s useful. So I want to I want to talk about some of the big data stuff. So I mean, we’ll get onto storytime sorry, Joel’s stealing a storytime. And a little while, but the most common, the most common situation is people thinking that putting all of your data in one place, makes great sense. You know, it’s just, it’s all there. It’s all in one place. Let’s have one big, giant data platform, let’s stick all of our services over the top of this giant data platform, and then people can just engineer against it. So we’re good to kind of talk about that. Because I know you’ve got first hand experience of actually being on a project like that. And we could come on to the storytime in a minute without naming the actual name of who that is. But yeah, so good to get your perspective to talk through, I guess, why you think people do that? Whether you believe it’s the right thing to do, as well. And then what’s the alternative? If you believe that it isn’t?

Yeah, I guess we’ll say the, the why I guess it’s often it’s conceptually easy to think about, like, Okay, well, I put, if I put all my data in there, then that means I can then extract like the value of it and saying, where’s my called store all my oil or whatever, gold in one place, and therefore I can then somehow leverage it better. The fundamental problem with with a lot of that is, is that your data often isn’t like samey. And people have different concerns over what that is. So it doesn’t fundamentally end up working.

So you mean basically the data sets, how they’ve been, how they’ve been inputted to begin with, we’ll be fundamentally diff, even if the data underneath was the same. That could be possible at what they’ve called, it could be different. Or like, even just the data types could be fundamentally different. You mean that Yeah,

and desires of quality, right? So if and format and things like that. So like, things like say, some people might model a telephone as being a location, and some might model a telephone as being a physical item, an object. And that concept makes sense for the right domain, but not necessarily useful, when you’re trying to actually then come across referencing so it’s not really a thing that you can do. And it could be simple stuff, like, I take the telephone number, whether you’ve got like the full country code thing, or plus four, four, or whatever, for UK number at the start bit or what it is. And it’s, it’s not necessarily useful to try and do that and normalise all your data down to what we’re inevitably going to be the lowest or highest common denominator where everything is like super normalised down. And that then makes it really expensive compute and otherwise to actually query and use it. Because you’ve made everything, taking everything to the most extreme, curious nature of what it is and then modelled it in some data model that’s universal for everything. You’ve kind of made it useful to know one apart from your modelling exercise, but your applications that are trying to both read and write to that. That’s not really particularly useful, and you put massive burden on them actually consuming it.

Yeah. So I guess, I guess what you’re saying is, if you’ve got certain business unit, like it could be one organisation, but the application services that you’re designing might appeal to the users in one vernacular. So you might Name something a certain way that makes sense to those users. And there’ll be some other users that it makes different sense, not quite the same, but actually brings value being different because that’s how they see the world and that’s their day to day job. And so they call that thing they Or like you’re saying, like the telephone is location, but then to another team, it’s an object. And depending on how and what they’re doing is jobs, that makes perfect sense to them. So then then change it. So it’s like it’s one or the other than do values in the end, or you have to map and bridge that when you’re calling this, this over here, we actually really mean this over there. And then you’re into all these weird,

yeah. And you can do that apps, obviously, like, that’s all possible to do, but for why, like, what, what’s the actual business value that you’re going to achieve from that. And if it’s just for a purist, common nature of kind of modelling all your data and having it all in one place, then it’s actually kind of considered betrays the concept that you were originally trying to apply of like, well, we’ll put it all in one place, therefore, we can index it. And we can secure it all because all of our security policy becomes simpler and easier. And even, that’s not necessarily easy, because you’ll have some cases where you’re where some folks will really care about the integrity. So audit, for example, there and things like that audit is you care mostly about integrity. So that probably looks like some sort of transaction store of like streaming events coming in that you want to make sure that you can replay and see exactly how something happened. Whereas your, like your credit card data, where you probably care more about the kind of confidentiality of that, if you’ve lost some or it was if it was became wrong, then that would be less of a problem to you. Because you could probably just ask the customer for it again, or make up a reason why you lost that data. But that’s much less of an issue than in the different kind of guises of things. And like what you can back up from and what that data looks like as it’s coming in, like if it’s coming in as a kind of firehose of events that you’re just storing because hopefully, you’re you’re maybe kind of want to recover that in an audit one day and you need to for compliance or peace of mind, you’re storing it. That’s an entirely different thing to what you’re at. And either your online web sales might be a customer database site, you don’t necessarily care about when you’re looking at like the customer data store, like all of the changes that have happened to that customer account, you just need a snapshot of what it is today at the minute. Yeah. And if they’ve gone and changed their email address, well, that’s fine. But you don’t need to know all of their last ones. And why when they changed it and what browser they were using when they changed and all sorts.

Yeah, I see that. So just I think this is a thing, isn’t it? When it gets to like, when it gets to the data side, and people start. The intentions are usually good as in because they oversimplify. And I think then they generalise because they’re like, Oh, well, I’m just gonna treat like data is just all one thing. And it’s like, right, you’re just say no audit might be one thing, like, so it’s like, What does audit mean to you? Do you need to search it? Is it something you proactively to do? Or is it just something because you’re audited as a company? Maybe it’s PCI compliance, maybe it’s something else but you don’t no one’s like desperately wanting to check out audit across the business for life from pure excitement and joy of life. I wonder what somebody did right this minute. Now you don’t have a those of things like that. It’s not necessarily an actual business lead. And so how you treat that should be done differently to then how you’re treating the other data. So that is quite important to kind of fragment those things away. So you actually condition allies? The ask, I do want to I definitely do want to get into the story die because I think it will help people contextualise so it would be useful. to kind of talk about a platform you’ve worked on not naming names of the platform, let’s call it platform, we’ll call it

a day a data platform that was common, which is common to meaning Right, exactly to an organisation, maybe I mean that that was part of the question, right. Like whether it’s common data or whether it’s a platform that is common to an organisation. And that in itself is like part of the challenge establishing like war is common. And in that regard, there was one bit of the org that they had a relatively small amount of data measured in gigabytes, but the integrity and confidentiality of it was especially important to them. Whereas another side which was quite rapid changing with gigabytes a day can make coming in quite complex data. But low data quality worth adding on that whereas the other one was high data quality. The day would get so in like breach scenario, they would want their data to be available, because that was more useful to them to be for, for that data to still be usable, because they needed it for operational needs, and that was a risk decision that they’ve made. Whereas the other side, which have been more small data set, they want to pull the plug at the first smell, and a sense of anything being awry.

But what was the ambition of it? So what problem are they solving? I mean, I do know a little bit of history, but I’ll let you explain. But it’d be interesting to just kind of catch that. What was the actual what was happening inside that business? You know, which I’m pretty sure it’s probably quite common Anyway, I’m just laughing. But then that drove them to thing, we really got to fix this problem. And the way we’re going to fix it, is by doing this, and we’re going to build this common platform platform next. Gen Y x just sounds cooler, doesn’t it? But why should a call?

Yeah, it was the the status quo that start was that there were dozens and dozens, if not a main kind of well known kind of data platforms in inverted commas of like stuff, with like organisational data in it, that were mostly kind of managed by large, external organisations that provided that as a thing with some licence costs attached to it, and whatever. So let’s say it was 50 or whatever, but arbitrarily a number. And their aspiration was to rationalise that down, not for necessarily good reasons, apart from like, some complexity and stuff, or say, or qualified reasons. But the aspiration was, was to make that number lower, because there was an implied belief that, that being lower would result in lower overall cost, and simplification of the estate. So they took the, with the desire of trying to make something that’s what’s sold, solve all cases, they took something that was the two ends of the organisation are taking something that is kind of high volume. And quite important for what for that use cases. And kind of fast moving, and then take something that’s more volume, but very security risk averse, to try and put smoosh them into the same thing with some view or aspiration, that in some point, you would be able to query between them. And being able to do some interesting analysis on quit on querying both datasets, and that would allow you to enrich the model

a bit. So there was there was almost like two things. One was like reduction. But the other one was ID ideas around what they want to do feasibly from a business perspective to match up some data to figure something out is that, oh, if we can just query these things and know about this stuff, that’s going to bring some value to the business unit. And we can set up a project and they can start writing code to do that. So they believe that you have to have it in one place to achieve that plus also reduce the estate overall.

Yeah, I mean, historically, like the the org does, that does things very desperate way. So these are currently cloud, we’re running on the time on, like vastly different on prem data centres managed by different vendors. So you would never have been able to draw a line between the two things, fundamentally wouldn’t have been possible. But there also wasn’t a really a business case of actually, well, this is a question we want to ask this data. That didn’t really exist. It was an aspirational thing of like, well, if we did this, then they will come also. And it wasn’t like too far fetched to start thinking about what those might be. And it was the the given the datasets, it was kind of reasonably obvious what you could probably do with it. And no doubt those those use cases would come. But a lot of it was kind of proving like it’s possible, you could but in reality, it wasn’t what the needs and asks were too far apart to make something that was compatible with both ends of those spectrums.

So what was the tech? So yeah, so what was the starting point of this is where it kind of gets interested. I guess you already mentioned the hadoo. Oops,

yeah. So it was a it was originally a Hadoop based thing with a aspiration of putting a graph database on top of that. And with Elastic Search come running for indexing and all sorts. And it was it was built in a very on prem type mentality and not leveraging anything that the cloud vendor had to offer.

So as on prem and cloud, it was on prem thinking in the cloud.

Absolutely. Which inherently like super expensive. Yeah. Especially like all the non production infrastructure was way expensive for what it is,

what was making it expensive, the amount of data or like just the charging, what was it the O’s,

it was? Yeah, I mean, it had no actual real data in it, because it was mostly clean development, but like the VMs, and because of AWS. So it’s a site called them out, but irrelevant, really, but they hate the way that you were able to achieve performance to the disk was to effectively come over size, what storage you needed, right to be able to get the I ops to the disk. So consequently, like massively over provisioned, even for the non production, like development time clusters,

was that like a Hadoop specific set of requirements they do they give you set requirements or something in like, this is the VMs that we’d need for this level of performance, or

they might have done but I’m not sure how much that was, I wouldn’t throw a deep or anyone can have any of them under the bus. I think that was mostly like the implementation in order to meet the performance desire of what we were trying to do with that. I don’t know where those those numbers specifically came from. But that was the end consequence was that you ended up with a completely can not not leveraging anything of like why you would bother go into Cloud, the only reason that it was rear ends. And also because of because of that, and how it was working. It was all on one availability zone. Right? In a single region. So no ability to know cloud value at all, really,

all the things cloud as well, none of that was realised on but

yeah, absolutely. And with our own, like trying to run an on prem HSM and trying to share keys to it, but not using any of the cloud vendors, key management at all, like doing that all ourselves. And it was all about merit. So yeah, we went through like, some bits of journey to look at how we could throw most of that away, if not all, and look at providing a what ended up being the conclusion of that project was a pattern ultimately, rather than like a single data platform. And with a loose kind of capability of like, well, this is how you would conceptually query more one or more of those datasets.

So to recap a little bit, because I’m going to, I’m going to do some educated guessing here, that somebody with the Hadoop skills was on the project, I am guessing. And so it was technology led first problem second, to a degree of like, what are we what are we actually really trying to do with this data? What technology lends itself best to all this stuff? So it is more like, Well, we’ve got the tech for data, Hadoop in this case, because I’ve got the skills. Is it unfair of me to say,

it was certainly a degree of that? I mean, it was a lot of like, well, if everything looks, if my tools a hammer, then everything looks like a nail type thing of like, well, that’s my natural thing I’m going to reach for when I think it’s big data, in reality, it’s cumbersome. Like that was hundreds of gigabytes in total, probably a data which really is kind of overkill for Hadoop would be overkill for managing that sort of thing. And yeah, like using kind of on prem technology for address trying to address that with a mentality that or mindset that there was a desire to be cloud vendor agnostic and and being able to kind of move pick up and move to a different vendor, which it’s daft and lost any other business value. I mean, the cost of running it would have been like in production especially would have been mental and would have huge, you could spend the thing up independently using kind of all three major cloud vendors, independent systems if you wanted to, and it would still be a lot cheaper. Yeah, to leverage their kit, right, build it three different ways and it would have been just to run it. And the responsibility that you have to take was like running a load of virtual machines.

This is really common though. This is thing because I think it’s humanistic, isn’t it to a degree of pick the people set parameters around people, people have certain experiences and certain things. Some people don’t necessarily ask the right questions to start saying, Well, sorry, what exactly are we trying to do with this data? Like what is the exact use case? And on which bit on the data specifically, to then start to work out the right tech, that kind of gets approached holistically of light data? Plus tech, Hadoop? Oh, multi cloud. Okay, definitely. Now more Hadoop, you know, and all these like little that you layer on all this security, but HSM don’t trust cloud. So you end up with all these decision make making process where people come at it from? People have traditional answers to really what’s probably quite a bit of complexity there that needs to be thought through first, before we get to the answer of just deciding something.

Yeah, and there was, there was a with within that team, there was a weird, emotional distrust of the cloud vendor, which was weird to have, like some belief that if we were running the VMs, would we have the cloud vendor would have less clear insight into what we were doing and what the data was. So we protect it from them. But in reality, like moving key material over public internet, from an HSM to our VMs, not even to the cloud vendors, key management was, like, bonkers. That Yeah, you can, as you say, right, undid all the value of why you’d go to a cloud vendor and undid all the benefits of their, the shared responsibility model and their ability to offer like genuine segregation of duty. Yeah. Like when we walked them in, to come talk to us, and talk to the fight through all of the intricacies of like, how kms works, and how like their data works, and all sorts of other kind of the mechanics of things that’s behind NDA, like how the actual thing hangs together, we’re like, Well, yeah, we’re obviously not going to be able to reproduce all that that would be mad. Why would you that’s their business.

Let go Yeah, exactly.

So yeah, you kind of have to accept that, like, the cloud kind of vendor lock ins a real thing. But like, sure. Move along, like the the value release that you can get from that. It’s just simply not worth trying to do that, and run it all.

And lock into me, I think when they say, look, this is what I mean, this is a whole other subject. To be fair, I’m going to go on a tangent, but it kind of it when, when people think about locking, locking, most of the time isn’t massively vendor orientated. I don’t think sometimes it is. But a lot of time, it’s network, it’s all the things that people struggle to do that have a lot of risk associated with them. And so networking, right, networking, you know, firewalls, trying to bridge it Direct Connect. So is this let’s make cloud an extension of our on prem network, let’s do all this other stuff. And then let’s couple all of our apps to it. Right? So then you start coupling the apps to those networks to in that cloud. And so then when you then talk about locking, you’re like, Well, yeah, you are really locked in. And how long did that take you six months. So that? Well, yeah, now you’ve got another six months with another cloud to repeat the exact same thing, because you’ve just locked your apps to it, you’ve like, brought the two together, it’s like, that’s more located than like using a library, base of a library, you can switch out test, you know, it works. But that stuff is so fundamental, when the organization’s apps started to be designed against it, that the change is so huge that you never can move

some things, or there are some cases where it’s real, right? So if you are a public sector organisation, then there’s you need to be able to demonstrate that it was a fair fight for who you went for and you had to be able to regularly go through an appraisal process of Can I go well, can I move things around? Am I actually locked in? And the other vendors are as we’ve seen recently within the states of the the big whatever what it was, how many millions of dollars that was

the Jetta icon?

Yes. Got that. Just got Scott Carswell for this podcast. But But yeah, that’s fallen through because it was being challenged between whoever the two people that want it well, and then they kicked up enough of a fast, surreal thing, right, and how you can end up locked in and the other vendors can challenge that as they are entitled to have as a public sector thing. Sure. So yeah, if you go for, I don’t know, something like Neptune in or to Dynamo or something like that. There is like specifically that vendor. Yeah. Then Yeah, you need to be prepared to think about like conceptually, how you might make a big table or whatever you need to think about, like how you might conceptually pivot out of that, but I wouldn’t invest The amount of money and time and effort in actually trying to abstract yourself from that and just like reap through rewards of the decision that you’ve made to use that properly. I think that’s right that, that technology out to it’s much better What’s up, Mike?

I agree, but I think I think it’s more people don’t understand that when they think of locking, you know, that’s a prime example of how people attach things, you know, that the that that lack of vendor neutrality, right. So they’re like, Oh, this is very bespoke to this vendor. I’m about to use a library, and my app is gonna like, and no one saw it well, how quickly is apt to change? Right, really, in the grand scheme of things? It’s the other stuff that people don’t think is being locked into. Because the network is in, you know, is agnostic, right is all this other stuff that people don’t think about locking? Under the most, I want to make up a word locatable, which is totally major. But that is the bit that really stops things moving. And yet that isn’t a library. And it isn’t a vendor specific thing. But it is usually the biggest change to move anywhere else is all that others. And it’s all the ancillary stuff. There’s always a bit that’s the biggest problem. Never really that locking thing is like a misperception of it ready to degree.

Yeah, it’s a bit of a red herring. The only thing is like, if, depending on what sort of scale you are, like, if you have an actual big data problem, and an A, maybe you’ve got, like, say audit as a thing, right, that data set is going to get big. And if you’re, if you’ve got to, for compliance reasons, keep that for a few years, that’s gonna get really big. If you’re, if you’re doing stuff properly. There’s a reasonable outcome and volume of traffic going through it, the cost of just moving the bytes from one thing to another, and then thinking about how you’d move it from Azure to GCP. As a concept like, well, I’ve got to now move this across public internet, which means it’s a massive payload, which is really valuable as an asset individually. And I’m now got to think about how I’m going to move it. Yeah, exactly. And then you go, Well, why no, like, there’s no, there’s no terminal business value in doing that. There’s not wait, yeah, is the other side of lock in a slight gap? Literally, by the factor that you’ve invested Enough, enough in Vegas, there is just there is no demonstrable reward to moving it somewhere else, which often when you’d like people like price comparing between cloud vendors isn’t as much of a muchness most of the time really, it radiates, and it changes anyway.

So it’s a bit early today could be different tomorrow. So

yeah, except like it will be reasonably competitive between the middle class there is, like at least that it’s not a complete monopoly ish. That this, but there’s some competition at least. And yeah, they’ve they’ve got to the tip, get your pricing will be the same as the next guys ultimately.

So I guess that’s totally off tangent. Sorry. I mean, it’s gonna be problematic. Because I get a bit of a bean robot about certain things. But anyway, the data back to the data big Oh, yeah, the data. That’s what we’re here to talk about. Just remember, when you were doing the platform, platform x, and I knew I had, so those technologies, you then started to, you said, you’re going to essentially get rid of it, because it was too expensive. There was no real actual data in it anyway. So it was just costing a load of money, without really any return on value. So then you start to look at all the technologies and started to change some of those to match the need and the desire thing. You want to talk more about that? Like what was the change? How are you thinking about? I mean, you mentioned audit being one thing, I don’t know if that applied in this case or not.

We can’t wait, it was there. But it wasn’t the prevailing problem. Because like that, that’s a streaming thing. So right, yeah, I mean, to describe like the rough tech stank like there was, yeah, on prem HSM trying to do some stuff with a bit computer attached to it that would then share some keys over the Internet to some stuff, running through a bunch of VMs. And there was a Jenkins at the middle of it all, because there normally is that your massive kind of big vulnerability point that had keys to everything, and kind of undid any of the benefits of what you might otherwise try and implement. There was Yeah, Hadoop, those calf curb and all sorts of other bits. But what the Hell’s Kitchen Sink have kind of a an on prem Hadoop type world thrown at it and spark and

right, okay, so the ice decided to get rid of all because that’s quite like the common industry. Big Data tech stack almost, isn’t it? I’m pretty sure if I was to Google, yeah, big data tech stack, I think now.

Yeah. Again, the normal things that you point out, and yeah, Elastic Search, and the thing that was like a bit of the magpie effect of the technical folks that were involved as they were trying to make it into a graph database on top, so they were trying to run Janus graph which is Experiments look best really, as a thing on top of on top of it all, which was fraught with problems and was computationally massively expensive to try and do and ultimately, the end conclusion was was that there was no real rewards were and in fact, it was dangerous in in order to meet the performance needs, which weren’t high that you could can there were race conditions where you would lose data, which was merit. So because of the just the sheer volume of data, which say, wasn’t an awful lot, like it’s 1000s of things coming in a second, it’s not massive. But you could end up in race conditions where you would lose things by which, given what the thing was supposed to do is somewhat troubling. And but yeah, there was there was certainly a magpie effect of looking for a shiny thing and looking for something that was interesting, rather than I guess, tried and tested in someone else’s responsibilities to look after. So yeah, like this massive tech stack and started, I started challenging the initially like the the security benefits and merits of like, what they’d done and why like the the avoiding kind of challenging some of the precepts of well, feeling like they were married to a cloud vendor feeling like well, doing it this way would allow them to potentially move if they wanted to, despite the organisation not having any commercial arrangement with any other cloud vendor, or actually looking at how they do it. They felt conceptually that that would kind of unlock them to do that. And was what was asked of them ultimately, to build that future proofing in their heads isn’t is that our future proof to for agnostic women? Okay, yeah.

But then the data see change. So you do put a case together, then to suggest they do kind of go on the native cloud journey is in like,

stripping out like try and make it a little bit less bad, right? So start stripping things out, start pulling out like the like all of these kind of weird and wonderful bits of security that they just kind of put in place that were in charge of like, the how someone would get onto a VM to SSH onto and do some things or whatever. All of which was managed by a Jenkins that we managed. So who like classy? Yeah, exactly. Was the Jenkins Yeah, lots of groovy in there with like, their this kind of Jenkins in the middle of it all able to control everything. So break it all, with local accounts, breaks a break of that Jenkins, and you’ve got everything. Yeah. And it’s the same Jenkins that’s doing builds off of stuff that’s in GitHub to to the one that’s got access to everything. So like trying to explode those thing, that kind of bat as a as a central cold fears come up. Big flashing red hot. Yeah,

yeah, just like good secrets. Yeah, like

that. That thing, there is the keys to the kingdom where like, you’ve disproportionately put all of your effort into protecting these periphery things, ensure they’re a bit closer to the data. But you’ve neglected this. One thing that then has access to everything and full admin access. And that’s its day job to do that. It’s not like you’d even necessarily observe it doing that. Again, that’s really wrong. Because it’s doing that all the time. And it’s always logging in and won’t help. The Jenkins was also the monitoring tool. So the thing that checked, everything was up. So it was always logging into these boxes. So you couldn’t even have like alarm bells that would trigger when something logged into the box, which shouldn’t really happen. Because clearly, while they’re still connecting to that and going, you’re right. And Jerry will go yes or no or whatever. And then Jenkins will then potentially send out an email saying no. And misusing the tool, like again, like they had a had I had a hammer and everything just looked like a nail that they could achieve that through through that. So yeah, initially, it was kind of challenging kind of what that did is a thing of how it was connecting out and starting to break down. It took time to unpick,

but again, the requirements really the end is that what are the actual requirements here? Like what problems were you trying to solve? Yeah,

so initially, it was here, try and make that a little bit less terrifying. Yeah. And at the same time trying to unpick, like, why why other other decisions were and some of the things around like, like using Hadoop or whatever was a longer thing to try and challenge as opposed to some of the security things because very much security LED, you could just point at something and get that and you’ve it’s easy to Can I argue why you spend some time doing something on that making it a little bit less bad and just presumed that there was definitely a good reason that they were doing the things that they were doing it that way? Because they weren’t dumb people generally. But so presuming that there was some logic in it, but some of the implementation was a bit awry. But yeah, it turned out Yeah. A lot of it was misinterpreted and not aligned to business value of why they were going down that route. So getting stripped out. Ultimately, the end pattern was stripping pretty much everything out. and ended up looking like see the cloud native will have all

the Jenkins will have to poach. I guess in the end. Is it still there? If you kill Jenkins, did God grace Did you get the Dart word? I

mean, it was. If I was to tell you, the Jenkins was still out, I’d be exposing the fact that there is definitely no jank. There isn’t. Thanks. Yeah. But

so what if you were to do that again? What would your I guess it say this problem ignoring a company really, I guess, any business that is thinking about, like doing big data? What what would be like the principles for you that you’d look for on like, don’t try and I guess we’ve picked on don’t sanitise it all into some common denominator thing. And don’t try and translate it all as well to like some big kind of translator API thing that might have to go on the top. Is that

right? Yeah. Yeah, absolutely. Like, just recognise that Yeah, different. There’s, there’s different things for different needs. So yeah, like your, some systems will have like an event streaming thing. But you probably don’t care about trying to run like a query over that, right. So that could be like your audit record. Or, to know if you’re in like IoT, maybe IoT telemetry coming in, or coming in from a bunch of sensors, or something kind of feeding into a thing, a constant stream

of data, yeah,

that you care about storing. And you’ll occasionally care about wrapping that up and drawing some conclusions over over time or drunk, some nice pretty time series graphs or whatever else. But your concerns about that are very different to automate, like your customer database, or your credit card database, or whatever like that. Those you have different concerns about and it’s the main thing I guess, to establish is is kind of where are your what’s a system of record versus a system of truth versus system analysis? and so on? Like, what, what are the different common kind of use cases for those things? And once you can say, well, let’s put all our data in one place. Well, like the best answer, I can probably kind of lean towards as well sure one place but like make the place like the cloud vendor, arbitrarily like whoever you’ve got a relationship with, like it’s much of a muchness, nowadays, really, most of the time, unless there’s something particular shiny feature that you really want to exploit from a given vendor. And, and yeah, make that make the place the one thing because that makes job applications, if they end up querying more than one data set, or you end up with some data applications that are querying those and producing a abstracted interface that does the query to both that you can, you can do that all within a one domain without trying to handle networking within between cloud vendors, which is hard, like if you can do over the public Internet, but most people don’t have much of an appetite for that. Yeah. So you, and there’s no good story yet, for the equivalent of like a direct connect between AWS and GCP, right, it doesn’t exist even like the same region or whatever, like, huge, you can do it. But you have to actually put yourself in the middle of some routing somewhere in a telco facility somewhere that has the kit that you can do the connections in, and then router in which places like you’re now in the middle of that mess, but you really didn’t want to be one day. Hopefully, that won’t be quite such an issue. And it will be a bit more easy to do that. And direct connections between the vendors may one day hopefully be possible.

Yeah. So if you’re gonna have data, I guess some conditions got to know and we’ve spoken before, you mentioned things around like, not everything will need to be indexed. Right. Not everything needs to be searchable. Yeah. So deciding which bits are which bits on the tolerance level, the cap there room, things like that, which you kind of spoken about that you want to talk about those things, actually, when do you decide I guess these are the conditionals? Aren’t they and what you’re going to achieve these are the earliest these are the these are the these are the specific questions you probably have to start asking, have the data so you know, what solutions you need to put in place as opposed to knowing the solutions without knowing what the data need is, I suppose so. What All those things, what questions would you be asking? If I was like, Hey, Chris, I’ve got loads of data. Desire compensation. Yeah. But why are you talking about that job? Babe, okay, do you is that hey, that I’ve got, you know, we’ve got this big data project, we want everything Central, you know, we’re gonna have some services that can query all this data, it’s gonna be absolutely awesome. It’s exactly what we need. What would you be saying?

I’ve got for what you’re doing, john? I think that so the first thing is like, establishing like, what that data looks like. And some of their overarching concerns, I guess it’s at least co set the frame of it. So is their personal data in it? Is their credit card details for like PCI type concerns in there? Or is it like simply like telemetry? Like, I don’t know, maybe you’ve got a tonne of weather stations around, and you just want to collect like the temperature across a city? Right? That’s also perfectly valid. And in which case, well, that’s different type of concern. And like, it all depends on what sort of shape of that and then start thinking about like, Well, whatever, like, but

I want it all in one place, Chris, like, I just want it all in one place. You know, what

is what what is the all I mean, that’s that’s, that’s a racial thing to try and establish. And then the, it ultimately comes up for why why do you? What are you trying to achieve from that? And start thinking through the what you’re, what you’re, you’re kind of leaping towards the the solution and the answer without actually thinking about what the business question is, right? So the business question is, I want to be able to write a succinct, simple query that will that, like anyone, that’s my, my, my data analyst and my sales folk, or someone can write like a primitive query that queries the thing and tells them an answer that comes from two data sets, then sure, but there’s different ways to technically achieve that end goal. If that’s the case, that would respect better things mean, you’ll have all sorts of concerns around implied confidentiality of things, and the one what not. So if you put all of the credit card details in a thing, then you can potentially expose kind of some bits that might be interesting to certain people of what what’s there even if you mask out bits of the car details, whatever depending on what audience is, and where you do that as well.

To be Navy start anonymizing things you can probably work out, like how you’re anonymizing those things to them, and then kind of look for trends in the other things. Yeah. pattern in it all to be like, I think this is this person, because they do all these things. Yeah. here and then did that and you start to basically trend to work out. Even if it’s anonymized, you could probably work out is that what you’re saying?

Yeah. BUTCHER the story. But there’s a saying around like, it was the pizza shop that was near a, some American military something facility. And the pizza shop would always know, when there was an operation going on, and who the generals were, they didn’t know which ones but like know the mix, which would then tell them roughly where it was going to be based on the pizza order that came in, right? Because the people always order the same pizza from some external pizza shop to the organisation. And that was became a thing that made no like that there was going to be some operation somewhere. Yeah. And based on the pizza will then need to know the names of the actual generals. Well, I know that this, this order looks like that. And just being able to determine the consequences of those things. It’s easy to kind of D anonymize datasets if you’ve got more than one data point. So in that regard, also, they had a pizza order, and then they had some stuff happening, and they knew where the pizza was going. Obviously, that helps, but they could call it is easy to then correlate those two events. Yeah. And the fact of which generals is or what the actual, the line that’s drawn is largely irrelevant, but you can pick from or you can pick out the commonalities. So if you’ve learned anything from this podcast, it’s mix up your pizza orders, that’s

just experiment. All you need to do is just mix up your pizza orders and it’s all going to be fine. Just pick something different. Maybe every two weeks Pick a different one

picks up the same patient every time but yeah, it’s it’s a it’s a real thing. And like you could pick you could like anonymize at night, taxi journeys, and if you’re like a ride sharing of the world, then sure he could remove the names but if you’ve got the locate the pickup and drop off points, you can probably make a good guess who those people might mean and how regularly they make those trips and all the rest of it and you can if you’ve met After that, to know flight details or whatever, then you could pick the location that they got picked up from a flight time or whatever. And all of a sudden, you start figuring out that that person’s probably going to get on a flight and probably going to go to one of these locations. Yeah, because you pick them up from a given point, taking them to an app or

a certain type by Uber data, basically, I guess, if you had their day, you could probably work out, you know, who the officials were, maybe they’d come from where you’re picking them up from other ride sharing apps are available. But yeah, like. I mean, I’m not just prepared noobish has to do that. But yeah,

like pick any of that or any of those datasets, as soon as you’ve got another point yet another collection point, then you can quite rapidly D anonymize that data and start figuring things out and inferring stuff. And it’s that that’s the big concern, particularly when you start kind of thinking about, like different ends of the business. And when you’re Matt on, you’re kind of putting all this data in one place. So that could be an A, like your customer database, which concludes like all the personal addresses and things like that, and their credit card, so don’t necessarily need to have them in the same place. Yeah, but there’ll be some common references between the two. And if you’ve got that, then you’ve got enough to be able to do a cardholder not present transaction somewhere because you’ve got their address, and you’ve got the full card number and CV, and potentially more data about them to be able to come up further through verification processes, or whatever else you need to do.

So what would you suggest them without to to split the data? Like, what do you do then if you just how do you anonymize? If your anonymization isn’t strong enough? Is that how far do you go with the anonymization, to the fact that you might just totally you know, redact the usefulness of the data, it’s for the maybe you’ve got some algorithmic thing going over the top, so that you still can test properly or your application against it, but not to the point where it’s so anonymized, it’s lost any sense of meaning at all in the value of what you’re engineering?

Yeah, it was kind of isolating systems, the different use cases and things like that. I mean, you don’t necessarily need to bridge and have a single thing that knows about both the customer data and the card data, I think PCI card details are their own thing, really. And there’s our own rules around how you handle that and how you have to keep it separate. So that’s not necessarily the best example. But even keeping all of your personal data in one location. Might be fine, right? But it all depends on like, well, what’s the cost of the business if something goes wrong? And so if that’s the confidentiality of it, or the integrity of it, so someone leaks it, or someone changes something? What’s the consequences? Someone changes a customer’s home address, and you then start shipping out their regular order to somewhere else? Well, what’s the what’s the business cost? When you’re going to get to know about that that’s gone wrong? And how much does it cost you? And then is that then therefore proportionate to spend some time and effort mitigating against that potential potential event? Be that like whether someone internally changes it or like a an often like an off by one type lookup, or mis reference between two things? I mean, how often is like a mail merge between two datasets gone wrong? Right? I mean, the amount of times that you can get posted with the wrong name on it, or emails address the wrong name or whatever like that. Yeah. It’s a real thing. And yeah, when you’re bridging kind of doing mail merges between stuff. Yeah, obviously goes wrong. And then yeah, what’s the consequence of that? And then what do you put in and wave ways of checks to to mitigate against those those problems? And doing those in a proportionate way to what the actual downside would be if you don’t adjust them? But yeah, I mean, it’s thinking about like, ultimately, going back to like, when you separate things out, it’s the what’s the use case of when you’re going to consume it. And it often makes sense to have dislocate a firehose of data coming in with events that you’ve pushed somewhere. So at least you’ve got an ability to track back and replay history. And your retention on that is either your own appetite and budget, or the compliance reasons of why you need to keep start. So if you’re a bank, then years, if not forever.

If you’re a mom and pop shop, then maybe less so. Or at least So for the most of it anyway.

And yeah, it’s the annoying answer is always It depends, right? But

like, there’s this place conditional, isn’t it? I guess every business is going to be different. The risk is going to be different, because your business is different. So what you’re doing is independently.

Yeah, absolutely. Yeah. I mean, apart from that credit cards can have an intrinsic value in their own right. Yeah, like some assets but personal data, depending on what and also depending who your customers are. has a Certain Valley like if you’re saying, if you’re got a European customer base, then that becomes quite an expensive kind of breach.

It goes now. Yeah.

Whereas elsewhere, it’s kind of less potentially less of a problem. In terms of kind of financial penalties, obviously, you’ll lose a load of PR from it or whatever. But But yeah, in terms of actual head on kind of financial impact, then you don’t need to look too far from like, British Airways or whatever. With a recent funds to come out. Well, yes,

it actually is. Real material, but big times. Yeah, there’s a lot of lights.

Yeah, absolutely. Like, it’s, it’s a big den. And not something to Can I just brush off on Can I say isn’t a problem, you need to kind of think about that. And it’s, the main thing that’s important to think about is what’s the value of the actual thing like everyone talks about data being like a value asset that you should look after will go on then like, think about what the actual, the the value in that thing is? And it will, who is it valuable to like, if it’s only valuable to you, then you’re probably not going to be the target of a breach, right? If you’re collecting weather data, or I don’t know, like over telemetry or whatever. May if we wind up to a telescope or something and you’re watching the stars like that beta is going to be super important to you. But not worth anyone’s time, generally to apart from trolling you to change it or delete it. It’s largely an interesting, right, so but it’s proportionate to the mitigations be proportionate to what the actual data ism who is what it’s worthwhile to people.

That’s good. So obviously been an hour, so we’re gonna probably wrap up. But he’s been really, really interesting. Really interesting to hear the story as well. Obviously, you’ve got any questions, but Chris, feel free to message us. We will definitely follow demand I can people reach you on Twitter or anything. I’ve got a name on Twitter. I don’t think I’ve ever tweeted them. But anyway, you can you can reach us through the podcast. Obviously, you could tweet us at Cloud underscore unplugged, or cloud unplugged. podcast@gmail.com. And you can find us on YouTube, but it’s been really good. speaking to you really good to hear about platform x. Bar this i’m sure in the future, who knows. But yeah, anyway, thanks for your time. And thanks for having me. Yeah, cool speech zoom by

a very special thanks to Chris for sitting down with john and recording this episode. It’s not often that I get to hear the episodes the way that you do. But I thoroughly enjoyed that and think it was full of valuable insights. JOHN and I are going to be at cube con North America in Los Angeles, California from October 11 to the 15th. And we would love to meet you in person. We might just maybe have something special planned for you podcast listeners, but you’re gonna have to tune in to future episodes. To get more information on that. As always, please rate review us on your favourite podcast app, you can tweet us at Cloud underscore, unplugged, or email us at Cloud unplugged podcast@gmail.com on YouTube, we have episodes, transcripts and bonus content. And in the episode description, we have a link to our slack community where you can join us and keep the conversation going. As always, thank you for listening, and we’ll see you next time.

Transcribed by https://otter.ai