Query for records that possibly match other wildcarded records

cnstarz · May 17, 2016

I have this database:

As you can see, both of jtb_Affected_Target's foreign keys are linked to tbl_Targets.Target_ID. This allows me to relate targets from the same table as Children or Parents to other targets.

Here is frm_Target (based on tbl_Targets):

The subforms allow me to link targets together as either Child or Parent targets. As you can see above, "*@gmail.com" has two child targets: email1@gmail.com and email2@gmail.com. Navigating to those records in frm_Target will show *@gmail.com as a Parent Target:

I also have a target, email3@gmail.com, that does not have a parent target:

It would make sense to make it a Child Target of *@gmail.com, just like email1@gmail.com and email2@gmail.com. But, before I can get to that point, I would need a query that:

IDENTIFIES targets that do not currently have any Parent targets, and
MATCHES them to potential Parent target candidates.

#1 is easy -- I have a query ("qry_Targets_Orphans"):

Code:

SELECT 
   tbl_Targets.*
FROM 
   tbl_Targets 
      LEFT JOIN jtb_Affected_Targets 
      ON tbl_Targets.Target_ID = jtb_Affected_Targets.Child_Target_ID
WHERE 
   jtb_Affected_Targets.Child_Target_ID Is Null

Which yields the following results:

You'll notice in the above screenshot, there are also targets *@*email.com and test.123@subdomain.email.com... these should also be linked together as Parent-Child. How do I modify this query so that it can find targets where the names are LIKE each other? (email3@gmail.com is LIKE *@gmail.com; test.123@subdomain.email.com is LIKE *@*email.com) Ideally, there would be a third column that shows the matched, potential Parent Target. I attached a copy of the database for you to see.

plog · May 17, 2016

I don't think your tables are structured properly. You need to store discrete pieces of data discretely. That means, each signficant piece of data goes into its own field.

In common use, an email is one piece of information (johnsmith@yahoo.com). However, in your database it is not, it is 2 pieces of data (username and domain). You need to store it as such. You need one field for the user name (johnsmith) and one field for the domain (yahoo.com). When you do that, your issue becomes simple. In fact, you no longer need to try and hack your way into it, in fact, one table will provide you with the information you need, not 2 tables that require an awkward matching query.

cnstarz · May 17, 2016

In this database, an email address is one piece of information. They are not company email addresses, they are external, 3rd party email addresses that we are wanting to implement blocks on (prevent inbound/outbound emails to/from said email addresses).

Sometimes we'll need to block individual email addresses (IE: user1@baddomain.com) and sometimes we'll need to block entire domains (IE: *@*baddomain.com). So, if you block the entire *@*baddomain.com (which could be a Parent Target), then you also block user1@baddomain.com, user2@baddomain.com, etc, which could be Child Targets.

Thus, I need to find potential Parent targets for targets that already exist in the database.

jdraw · May 17, 2016

Tell us about the logic involved in determining a blocked email or domain.
It seems you have identified HOW you will do something, but you really haven't told readers WHAT that something is (in my view).

plog · May 17, 2016

I love it when people expicitly tell me I am wrong:

In this database, an email address is one piece of information.

And then go on to explain that I am correct:

...sometimes we'll need to block entire domains

In your system, an email address contains 2 pieces of information. You need to store it as such.

Then to accomplish what you want, you would have 2 tables; one to block domains and one to block specific email addresses. Actually, you could probably get away with 1 blocked table, just use an * in the user name to designate the entire domain is verboten.

Minty · May 17, 2016

I'm puzzled - if you want to block a whole domain - why would you be bothered that you might already have fred@about2BblockedWholeDomain in your blocked list or Vice versa?

If you have already blocked the whole domain why would you be bothered about needing to block fred@AlreadyBlockedWholeDomain as he won't be getting through anyway.

cnstarz · May 17, 2016

jdraw said:
Tell us about the logic involved in determining a blocked email or domain.
It seems you have identified HOW you will do something, but you really haven't told readers WHAT that something is (in my view).

Okay, let me run through a scenario. We will receive intel that there is a spearphishing campaign where actors from Country X are sending malicious emails with the following attributes:

Subject: Your USAA Has Expired!
Sender(s): bademail1@gmail.com
bademail2@gmail.com
bademail3@baddomain.com
bademail4@baddomain.com
bademail5@baddomain.com
bademail6@baddomain2.com
Attachment(s): usaa confirmation open.pdf

We will want to block those email addresses from emailing us. Obviously we don't want to block *@*gmail.com because our network would not be able to send/receive emails from gmail.com. So for those, we're stuck blocking those individual email addresses. We can block the entire baddomain.com domain though (*@*baddomain.com). We are unsure the effects of blocking *@*baddomain2.com because 1) only 1 sender utilized that domain, and 2) we don't know if baddomain2.com is really bad. So, we would just block the individual email address, bademail6@baddomain2.com. So, there are 4 targets we've identified for blocking:

bademail1@gmail.com
bademail2@gmail.com
*@*baddomain.com
bademail6@baddomain2.com

Now, let's assume that two weeks later we get some more intel for another spearphishing campaign, this time there are 100 different senders and they are all originating from @baddomain2.com. At this point we will just block *@*baddomain2.com. But, what about bademail6@baddomain2.com that we targeted earlier? It should be a child target, however, it was created 2 weeks before the parent *@*baddomain2.com was even identified. Hence why I'm here now, looking for a query to identify targets that aren't children that could have potential parents.

plog said:
I love it when people expicitly tell me I am wrong:

And then go on to explain that I am correct:

You don't have to be rude. You were not correct; What I said was correct: each Target is a single piece of information in my database. It's not broken up into two pieces consisting of username and domains because again, as I stated in my previous post, these targets do not represent our company's email addresses. They represent 3rd party email addresses that we want to block. We do not want to block all domains that come in because some of those domains are necessary to keep communications with (like @gmail.com).

Minty said:
I'm puzzled - if you want to block a whole domain - why would you be bothered that you might already have fred@about2BblockedWholeDomain in your blocked list or Vice versa?

If you have already blocked the whole domain why would you be bothered about needing to block fred@AlreadyBlockedWholeDomain as he won't be getting through anyway.

Depending on the person inputting the information, we may or may not block the entire domain. Generally we will block the entire domain. But if someone is just untrained or ignorant, he may just input the individual email address instead of the entire domain.

This query is really meant to catch what already exists currently in my database. I already have a method in place that, when they create a new target (stupidemail@baddomain.com) it'll already look to see if *@*baddomain.com already exists. If it does, then it creates the stupidemail@baddomain.com target and links it as a child target to *@*baddomain.com. I would utilize this method, however, it handles about 2-3 records per second (our network is slow and I've also included record auditing for each transaction) and my database currently has over 35k+ targets that I'd need to run this on (which would take forever). Hence my request for a more efficient query to do this on a larger scale.

plog · May 17, 2016

You were not correct; What I said was correct: each Target is a single piece of information in my database.

I tried to help--provided a solution and explained in detail how this system should operate. I'll drop out, but I eagerly await someone else's solution that accomplishes it in the manner you have fixed in your mind, and I reserve the right to come back and say 'I told you so'.

Minty · May 18, 2016

I think you might be better to keep a white-list of "domains we can't block" as this will be a much shorter list to check against, this would then only leave you with the names to check.
Also do you utilise any DNSRBL's checks on your mail servers, as this will remove a lot of potential spam pretty effectively.

Is your data stored in a Access back-end or SQL server? If the latter the query would be much better run as a view or stored procedure on the server.

cnstarz · May 18, 2016

Minty said:
I think you might be better to keep a white-list of "domains we can't block" as this will be a much shorter list to check against, this would then only leave you with the names to check.
Also do you utilise any DNSRBL's checks on your mail servers, as this will remove a lot of potential spam pretty effectively.

Is your data stored in a Access back-end or SQL server? If the latter the query would be much better run as a view or stored procedure on the server.

I agree, we need to start whitelisting, and I've been preaching that for the last year, but that decision is far above my pay grade. This is an enterprise network consisting of over 500k mailboxes. We do utilize black hole lists, as well as white lists and various other ACLs.

I appreciate the help and suggestions regarding the root of the problem, but I can't address nor fix the root of the problem, all I can do I is figure out an interim solution (this query) in the mean time until the "Powers That Be" make a decision.

I've experimented with a query utilizing EXISTS and IN in the criteria, but those queries have been incredibly slow -- I let them sit for about 5 minutes before I stopped them.

cnstarz · May 18, 2016

plog said:
I tried to help--provided a solution and explained in detail how this system should operate. I'll drop out, but I eagerly await someone else's solution that accomplishes it in the manner you have fixed in your mind, and I reserve the right to come back and say 'I told you so'.

I appreciate that you tried to help, but you were rude with your response. I'm starting to see that most of the readers here aren't interested in just assisting with what's presented, regardless of how absent-minded it may appear to them, and would rather focus on the other small details of the problem that are out of my control.

I just want to know if there's a way in Access to compare a record to see if it is LIKE another record that may be wild-carded (have asterisks) in the same table . IE: baduser@bademail.com is LIKE *@*bademail.com. Regardless of the root problem, being able to understand how to construct a query like this would increase my over-all knowledge of Access.

Minty · May 18, 2016

I assume that the email addresses are all indexed.
The problem you have is that you are using Like with the wildcard at the beginning of your search string, this forces access to literally go thorough every record character by character. Even SQL Server would not be quick with 500k records, doing a wildcard at the beginning of string search. I'm not surprised Access grinds to a halt.

As Plog suggested if these where two separate fields person / domain you could reduce your search "bandwidth" immediately by excluding emails addresses whose domains aren't the same.

jdraw · May 18, 2016

This link http://www.techonthenet.com/access/queries/like2007.php
will show you how to compare using LIKE.

As Minty said, a wildcard at the beginning will not use an index and will be slow relative to a compare with a non wildcard in first position.

cnstarz · May 18, 2016

Minty said:
I assume that the email addresses are all indexed.
The problem you have is that you are using Like with the wildcard at the beginning of your search string, this forces access to literally go thorough every record character by character. Even SQL Server would not be quick with 500k records, doing a wildcard at the beginning of string search. I'm not surprised Access grinds to a halt.

As Plog suggested if these where two separate fields person / domain you could reduce your search "bandwidth" immediately by excluding emails addresses whose domains aren't the same.

The Target_Name field is indexed, yes.

I understand why splitting up the field into 2 fields might make sense, but it's just not a possibility right now (I was already looking into it before I even made this thread).

cnstarz · May 18, 2016

jdraw said:
This link http://www.techonthenet.com/access/queries/like2007.php
will show you how to compare using LIKE.

As Minty said, a wildcard at the beginning will not use an index and will be slow relative to a compare with a non wildcard in first position.

Thanks jdraw, I know how to use LIKE, but I'm wanting to query ALL records where the Target Name is LIKE other Target Names in the same table. IE: baduser@baduser.com would be returned if there was a Target Name *@*baduser.com.

Something like:

Code:

LIKE [EXISTS (Select tbl_Targets.Target_Name From tbl_Targets)]

jdraw · May 18, 2016

Show all the SQL of the query you are developing.

cnstarz · May 18, 2016

This is what I have so far (qry_Targets_Orphans in the .zip I attached in the OP):

Code:

SELECT 
   tbl_Targets.*
FROM 
   tbl_Targets 
      LEFT JOIN jtb_Affected_Targets 
      ON tbl_Targets.Target_ID = jtb_Affected_Targets.Child_Target_ID
WHERE 
   jtb_Affected_Targets.Child_Target_ID Is Null

This so far just finds Targets that are not currently child targets ("orphan" targets). What I need to do now is compare the orphan targets to the remaining targets in the table to see if there are any possible matches (parent targets).

In the query, test.123@subdomain.email.com (Target_ID 11) is an orphan target. The query should be able to pair it with *@*email.com (Target_ID 10) because test.123@subdomain.email.com is LIKE *@*email.com.

jdraw · May 18, 2016

Need to see some sample records (real data).

However, I was helping someone recently and downloaded 200 records (voters) with random first and last names.

Here is a query, that compares VoterFirstNames and selects those where
the first 4 characters are the same. It also makes sure it is not the same record in the compare.That is, the record for compare is not the exact same record. This is done using 1 table, joined to itself.

SELECT v1.voterfirstname, v2.voterfirstname
FROM Voters as V1, Voters AS V2
where v2.voterfirstname like Mid(v1.voterfirstName,1,4) & "*"
and v2.voterfirstname <>v1.voterfirstName

The 4 checks the first 4 characters.
The v2.voterfirstname <>v1.voterfirstName says check where the names are not equal.

You could set this up to compare the first 6, or the first 5 etc.

Here is the sample based on first 4 chars

Code:

V1.voterfirstname	V2.voterfirstname
Juliana	Julie
Julio	Julie
Elissa	Elisa
Jazmyne	Jazmin
Alexander	Alexzander
Alexzander	Alexander
Kellie	Kelley
Elisa	Elissa
Shanel	Shanelle
Bernice	Berniece
Julie	Juliana
Julio	Juliana
Shanelle	Shanel
Berniece	Bernice
Julie	Julio
Juliana	Julio
Jazmin	Jazmyne
Kelley	Kellie

Note: In MsAccess the query could also be written as

Code:

SELECT   v1.voterfirstname, v2.voterfirstname
FROM Voters  as V1 inner join Voters AS V2
on  v2.voterfirstname like  Mid(v1.voterfirstName,1,4)   & "*"
and  v2.voterfirstname <>v1.voterfirstName

Query for records that possibly match other wildcarded records

cnstarz

Registered User.

Attachments

plog

Banishment Pending

cnstarz

Registered User.

jdraw

Super Moderator

plog

Banishment Pending

Minty

AWF VIP

cnstarz

Registered User.

plog

Banishment Pending

Minty

AWF VIP

cnstarz

Registered User.

cnstarz

Registered User.

Minty

AWF VIP

jdraw

Super Moderator

cnstarz

Registered User.

cnstarz

Registered User.

jdraw

Super Moderator

cnstarz

Registered User.

jdraw

Super Moderator

Similar threads

Users who are viewing this thread