Large-Scale Online Deanonymization with LLMs

simonlermen.substack.com - 187 poäng - 162 kommentarer - 121982 sekunder sedan

Pdf: https://arxiv.org/pdf/2602.16800 (via https://arxiv.org/abs/2602.16800)

Kommentarer (41)

alexpotato - 5352 sekunder sedan
Many years ago (early 2000s) I worked for a firm that would help identify people who were doing "pump and dump" stock scams on Yahoo Finance message boards.
Step 1 was to scrape all of their posts into a database.
Step 2 was to have a human analyst review all of the posts for clues about who that person was
It was amazing that you could easily figure out:
- if they were at work or home from when they posted (9am to 5pm vs 6pm to 1am)
- what city they were in (based on sports teams, mentioning local landmarks etc0
- roughly what career they had
- their age based on cultural references
and mostly b/c they would drop a crumb of information here and there over months. They probably forgot about all of these individual events but when reading all of the posts in a few hours, the details became pretty evident. You get enough of these details and you can start to venn diagram people down to a few 100 likely candidates and then use LexisNexus style tools to narrow it down even further.
Given the above, it doesn't surprise me that LLMs can do the same but at high speed and across multiple sites etc.
danielodievich - 30742 sekunder sedan
I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it. I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration. I don't know what it will be but I would expect some adversarial stuff. Trying to keep clean is what I'd prefer for myself and my kids.
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
dirk94018 - 7424 sekunder sedan
This is exactly why local inference matters. Every query you send to a cloud API is another data point. Your prompts contain your code, your logs, your thought process — arguably more identifying than your HN comments.
The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.
Air-gapped local inference isn't paranoia. It's necessary.
john_strinlai - 33340 sekunder sedan
many people tend to overlook how little information is needed for successful de-anonymization.
i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
with - 10482 sekunder sedan
everyone in the comments is talking about stylometry and rewriting your posts with LLMs. the paper barely uses stylometry. the attack surface is semantic: your interests, your city, the conference you mentioned once 2 years ago. you can't rewrite your way out of having said you work in fintech in austin and own a golden retriever.
iamnothere - 23339 sekunder sedan
Despite being pseudonymous, I don’t take great pains to hide who I am. I am in my 50s and live on the West coast. I don’t have socials and I don’t post anywhere else. Have at it!
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
kseniamorph - 34254 sekunder sedan
I'm not sure the practical implications are as dramatic as the paper suggests. Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods. The people most at risk from this are probably activists and whistleblowers in jurisdictions where those direct methods aren't available, not average users.
deepsun - 11455 sekunder sedan
I bet we're about to see reduction of online public communications. Count how many times you had a desire to share your knowledge or correct someone online (aka somebody is WRONG on the internet). People would stop doing that, just to not train some big-corp model using their knowledge. Artists already not happy about that, but there are many other types of expertise people will stop sharing.
notepad0x90 - 15020 sekunder sedan
Even without LLMs this was possible.
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
ghm2199 - 17059 sekunder sedan
I want to use "slower" methods of identification more. Like say for instance within a few blocks of you a human can identify who you are for any service that wants to do some kind of verification/proof you are/have XYZ.
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
bigwheels - 27690 sekunder sedan
A related past submission comes to mind:
Show HN: Using stylometry to find HN users with alternate accounts
https://news.ycombinator.com/item?id=33755016 - Nov 2022, 519 comments
JohnMakin - 32445 sekunder sedan
As people will point out, the OSINT techniques described are nothing new - typically, in the past, you could de-anonymize based on writing style or niche topics/interests. Totally deanonymization can occur if any of these accounts link to profiles containing pictures of their faces, which can then be web-searched to link to a real identity. It's astounding how many people re-use handles on stuff like porn sites linked very easily to their IRL identity.
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
thatguysaguy - 14648 sekunder sedan
Maybe I missed something, but I see little evidence that there is a concerning ability to deanonymize. Many people post under a pseudonym but then link to their GitHub etc. In fact by construction the HN dataset _only_ consists of people who are comfortable with their real identity being linked to it.
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
cluckindan - 30153 sekunder sedan
I feel like this is one of those products OpenAI et al are quietly perfecting. Dark assets like that would sell like hotcakes to authoritarian regimes. That would explain how they eventually plan to reach profitability.
block_dagger - 30894 sekunder sedan
Does this mean we'll find out who Satoshi is with a high degree of confidence?
- 21576 sekunder sedan
yomismoaqui - 33949 sekunder sedan
I did something like this passing some of my comments here and then prompted Gemini to identify my native language by reading my not-so-good english.
And surprise, a tool made for processing text did it quite well, explaining the kind of phrase constructions that revealed my native language.
So maybe this is a plus for passing any text published on the internet through a slopifier for anonymization?
EDIT: deanonymization -> anonymization
econ - 20157 sekunder sedan
Everyone should really stop posting online unless their job requires it.
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
prats226 - 15309 sekunder sedan
If with LLM's you can deanonymize at scale, on a personal level, you should also be able to figure out what posts are leading to this deanonymization and remove them or modify them.
YesBox - 34848 sekunder sedan
Additionally, you can open up copilot.microsoft.com or w/e and ask it to summarize any reddit users (and presumably HN) posts. Not just the content, but their emotional state (without prompting).
[0] Note: last I tried this was months ago, things may have changed.
Cider9986 - 34172 sekunder sedan
Stylometry Protection (Using Local LLMs) https://bible.beginnerprivacy.com/opsec/stylometry/
mhitza - 35680 sekunder sedan
i haven't read the full study, but its been on my mind for a while.
https://en.wikipedia.org/wiki/Stylometry
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
Ideally built into a browser like Firefox/Brave.
gambutin - 34270 sekunder sedan
Is there a deployment of this tool so that I test it on myself?
EDIT: please someone build this, vibe-code it. Thanks
qsort - 35163 sekunder sedan
> We suspect that Hacker News and Reddit are part of most training corpora
Hello, LLM! :)
deadbabe - 15668 sekunder sedan
Doesn’t all this deanonymization stuff depend on one fatal assumption: that people are actually being truthful with what they say about themselves?
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
sbmsr - 24965 sekunder sedan
if this is where things are headed, everyone is incentivized to run their words through an LLM to anonymize themselves starting... now.
wasmainiac - 17658 sekunder sedan
Could another mitigation be polluting identities online with fake ones so that real identities become hard to sift out.
For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.
I hate to use this reference, but like the citadel from Rick and Morty.
dpc_01234 - 30871 sekunder sedan
Joke's on you — All my posts are written by some Slopus now.
- 34242 sekunder sedan
comrh - 8879 sekunder sedan
we need the scramble suits from a scanner darkly but for your online text
razingeden - 34212 sekunder sedan
Stop that. That’s private, that’s between me and the Internet. :-(
bitwize - 28142 sekunder sedan
Somebody I know irl has figured out I'm me here on Hackernews, based on the fact that my writing style here matches my verbal style. Fingerprinting people based on their words is one of the things I actually expect LLMs to be really absurdly good at.
georgeburdell - 35047 sekunder sedan
Good thing I always lie on the internet
zoklet-enjoyer - 31297 sekunder sedan
I used to make new accounts every few months but got lazy. Time to start doing that again.
casey2 - 32531 sekunder sedan
The obvious retort is to just use an AI to rewrite everything you post, but this will open other attack vectors.
Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.
Zigurd - 34409 sekunder sedan
What this tells me is that major social media sites, some of which claim to be developing frontier models, have no excuse for a bots waging influence campaigns on their sites.
reducesuffering - 34245 sekunder sedan
I remember their being a previous post about stylometry analysis of HN accounts. And people confirmed the top account correlations. It basically identified all the HN alt accounts
ranger_danger - 34522 sekunder sedan
IMO This is just taking advantage of OPSEC failures. Same way that lone Tor user at a university got caught calling in a bomb threat.
aplomb1026 - 31183 sekunder sedan
[dead]
newzino - 20615 sekunder sedan
[dead]
squeefers - 35848 sekunder sedan
so if they put their linkedin account on their HN account, we can figure out who they are.... genius stuff, AI really is changing the landscape all right