
are you ready can you hear me yeah okay so hi everyone uh I'm Philip I work as a Desa cops engineer for marlink and today I'm going to talk to you about Vector search in sock environments uh first we are going to review traditional search methods uh then we are going to uh review some theory on Vector search and then try to investigate an attack scenario together using Vector search so traditionally what we do in sock environments we do some kind of flexal search which relies heavily on keyword matching regular expressions and performs badly on Ro and non-standardized data and usually requires some data parsing also when we try to search for stuff or the Tex stuff
it's uh suspectable to small changes in attacker behaviors for instance difference in small and big letters can break our search or detection so let's see what what Vector search has to offer us uh Vector search is supposed to enable us to search by meaning which is semantic search uh in this case search text as well as the query becomes an Vector which is created by a machine learning model um then match is uh created based on distance in nend dimensional Vector space uh and it's suitable to run on row data as we will see later in examples so generally we can say that there are dense and sparse vectors just briefly dense vectors are fixed length while sparse vectors ditch
zero values so are more resarch sufficient uh dense vectors require I mean dense vectors produced by this kind of models these models required to be trained on more specific data to The Domain that we are using it and U sparse vectors can General as well without extra fine-tuning if we try to look at vectors they are just uh in the case of on the right side is dense Vector just a bunch of values every one of these values are some kind of relation to they try to describe relation to the feature of the data uh on left side you can see a sparse Vector created by elastic clcr model as we can see these are the
key value pairs we can try to recognize some terms and the weights assigned to them so what happens when we try to do a vector search in Cas in case of dense Vector matching uh the ultimate goal is to find the nearest neighboring Vector to the query vector and this picture represents that well if this is the vector space we can see that the search for an apple is going to be cloes in Vector space to Apple as a fruit and to Apple as a company which is a good and a bad thing because uh it's not like fruits are related to the Google or Apple as an IT company uh but we'll see later in
example exactly some something like this case uh in case of uh sparse Vector matching uh this is the case specifically for elastic clcr model so input text and the query are passed through the sparse incoder model uh they do some kind of term expansion and uh synonym expansion mention assigned weights and then uh when you try to match it these weights they they do a DOT product uh add it together and you get the score and then the highest scoring document is returned at top so let's get practical I'm just going to show you briefly how easy how easy it is to set this up in elastic search so after you pick the model that
you want to use you can use El and import hop model tool by elastic to import a model to your elastic search instance after that you have to configure the right data mappings for dense vector or sparse vector and then addit your in pipelines to process the input data and create Vector from the input logs uh now I created one attack scenario that we are going to investigate together so we get a call from besides zarb organizers and they reported that they threat intelligence so their super secret file leaked and offer for offered for sale on dark web forums and they provide us with cloud trail and application load balancer logs in ingested in elastic search
so let's try to investigate this together um
so uh this is my Jupiter notebook and the first thing here is just uh some connection initialization to elastic search uh First Step uh we can create hypothesis is that uh this file was leaked from an S3 bucket which is not uncommon thing and in first example we use uh K nearest neighbor search with data that's uh processed with secure birth model so this is actually a model that's trained on a bunch of data related to security and we will try to search freely so we don't care about format of the data or what's inside the logs we will try to extract the meaning so the meaning is that someone got the object from S3 bucket with name bide
Zagreb uh so if you look at the results what we can see that we get a bunch of get object calls to an S3 bucket but it's the wrong bucket all of these top 15 I think logs are are related to getting um object from cloud trail bucket and this is very common with uh with dense Vector matching as we saw earlier we might get for searching for an apple we might get an apple as a company and also apple as apple as a fruit or maybe even bananas because they are C Clos in Vector space so uh here are some other queries that I also try to use but they perform badly I would
not say that they perform badly but the results were not correct so let's switch to the sparse Vector model in this case elastic CCR model uh and used the exactly the same query and then in top three results we can see that Lambda S3 user got object from the bucket named bide Zagreb and he got the super secret text file okay so we find we found the file we found a way how it it was grabbed we can see also some list object actions on the same bucket but if we try to find more stuff related to Lambda S3 user they couldn't find anything uh this indicates that an attacker escalated their privileges to this user and knew specifically about
this bucket and went straight to listing it and grabbing the file um so uh okay uh name of this S3 of this user that grabbed the file indicates that it's it is somehow related to Lambda function so let's try to really search for any kind of and we know that Lambda functions often often can leave of often can leak um can leak credentials either hardcoding in the code or through the environment variables and try to get if if someone tried to get Lambda function and we say that user agent is both a core because with this as we saw here um we can get someone using a AWS CLI and we can see that in top results
we get again uh these things related to the S3 bucket these things got really high score so they surpassed the actual get function which maybe is related to the fact that the get function call is not called exactly get function but they have a suffix of uh year and uh month and date but here we can see that uh we can see that the function name S3 retriever was grabbed WEA the AWS CLI from some IP address and the user used here is the AWS uh assumed role and from the Arn we can see that this is the role that's uh related to an ec2 instance and uh the way to achieve this is to use uh instance metadata
service um if we try any try to find anything more related to this instance because what we can get from here is this roll IP address but let's not stick with the IP address uh and the instance ID uh we couldn't really find anything so here we try to get the metadata related to this S3 Lambda uh role and actually cloud trail doesn't catch this stuff at all so let's pivot to application load balancer logs for more info here again we have a completely so the cloud TR logs are Json format and here we have a completely different format this is something like uh space The Limited values and again we are doing an query
on this R data I try to freely query for metad data am S3 Lambda and here we can get requests to the wnap ALB um and passing a request to the instance metadata service let see from the IP metadata and then related to this role we can also see uh put request for getting the token for the version D version two and now so we we can confirm this that this this was exploited this is the ssrf this ssrf vulnerability was exploited and let's try to confirm that this is actually connected to the instance that we identified earlier and uh let's get back to cloud trail logs and um let's say search yeah um I forgot
to say uh here from the target group we can see the name of the target group of this load balancer so if we search for the Target group and the instance ID what we find in in top result is an root user adding uh this instance ID that we saw earlier to this target group so that's it I think I think we helped help the guys from uh from B sides um let me get back to presentation to draw some conclusions uh so we can see that Vector search can be effectively used for security analytics uh it can lower onboarding time with new types of logs so we we didn't have to parse anything we didn't have to know and Care actually
if it's a Json format or some other kind of format uh sparse vectors as we so we used actually not sure if I mentioned but we used sparse Vector search all the way from from the first example and and to the end uh it brings better out of the box results because I didn't bother much with picking the right model and and so on uh dense vectors require to be trained on more relevant data so despite that this model was trained on some kind of security data obviously it needed more specific training on let's say logs from cloud trail um if we talk about using this for detection rules we can say that it's actually sensitive to similar values so
it might require a lot of uh tuning maybe some kind of HD hybrid surch that elastic surge supports and for further research I would like to see the impact of retrieval orent generation and semantic reranking on search accuracy um that's it thank you any questions test test test okay uh do we have any questions yeah Ilia do we have yours question are you
sure yeah just now TR Works TR blah blah okay yeah thank you for the for the presentation it's a it's a really cool idea actually to use like a vector search to to use to search the logs um kind of seems like a future of it maybe for some other use cases not just security but uh my question is uh how does how do these specialized models for security compared to more generalized stuff from open AI from for example with the embeddings and stuff like that have you maybe tried tried that well first of all there are a lot of limitations with dense Vector search uh because it's uh more research intensive and uh you can see
from the first example um this is specifically in elastic search uh you have to limit the maximum number of candidates per Shard so data is in if we get into elastic search internals data is divided in charts and uh this option is the maximum value that you can set here is 10,000 and it's 10,000 per Shard so if you have more data this is going to be even more approximate it's not going to pass through all of the values uh also resource wise storage wise it it it asks for much more uh than sparse Vector search so for now I think that sparse Vector search might be better but I haven't tried uh uh GPT models or something like that
cool thank you any other questions yeah
oh it's on okay uh I wanted to ask about the resources uh required for for for the sparse Vector search so let's say you have like a terabyte of logs is this going to run on my laptop or do I need like a super cluster to do it well in this case to do any any kind of this stuff you need elastic search machine learning machine learning nodes um I think the minimum requir require IR M for it is 4 GB of RAM and I think it all depends so this model is not very big but um I think it's do doable locally I didn't do it locally I do did it in in Cloud
so any other hand no okay thank you pH for this