GuardLabs Datasets

Public Datasets for Security and Anti-Fraud Research

As part of our commitment to improving the digital ecosystem, GuardLabs makes certain research datasets publicly available. These datasets are curated and anonymized from our internal security research, honeypots, and analytical models. They are intended for academic researchers, data scientists, and security developers who are building or testing their own anti-fraud and threat detection systems. The primary goal is to provide high-quality, real-world data to foster innovation and collaboration in the security community. Please note these datasets are for research and educational purposes and are not intended to be used as live production blocklists without significant further validation on your end.

Available Datasets and Formats

Our collection is updated periodically as new research is concluded. Current datasets include anonymized indicators related to malicious login attempts, comment spam campaigns, and characteristics of phishing URLs. For example, you might find a dataset of IP addresses with their associated failed login counts and user-agent strings, all fully anonymized to protect privacy. Each dataset is provided in common, easy-to-parse formats like CSV and JSONL, and is accompanied by a `README.md` file that describes the data schema, collection methodology, and any important caveats for its interpretation. This allows for easy integration into data analysis pipelines using tools like Python or R.

License and Usage Guidelines

All datasets published here are released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This means you are free to share and adapt the data for any non-commercial purpose, as long as you provide appropriate attribution to GuardLabs. We explicitly prohibit the use of this data in a commercial product or service that directly resells or monetizes the data itself. The intended use cases are academic research, internal model training, and educational projects. Using this data to build a commercial threat intelligence feed, for example, is not permitted.

Frequently asked questions

How often are these datasets updated?

Our datasets are not real-time streams. They are published as static snapshots after a research project or data collection period is concluded. Updates may occur quarterly, semi-annually, or on an ad-hoc basis when we have a new, high-quality dataset to share. Each dataset is versioned and dated so you can track its provenance. We recommend checking this page periodically for new additions rather than building systems that assume frequent, automated updates. For real-time protection, consider our active care plans.

Can I use this data to block traffic on my production website?

We strongly advise against using these datasets directly as production blocklists. The data is provided for research and is retrospective. An IP address that was malicious three months ago may be benign today. Using historical data for live blocking can lead to a high rate of false positives, blocking legitimate users. The data is best used to train your own detection models, which can then be applied to live traffic with appropriate safeguards and decay logic.

What is the source of this data?

The data is aggregated and anonymized from several sources, primarily our network of WordPress honeypots designed to attract and analyze malicious bot activity. We also incorporate findings from our security audit services and other internal research initiatives. All personally identifiable information (PII) is removed, and data points are aggregated to ensure complete anonymity before publication. We take great care to ensure our data collection and sharing practices are ethical and responsible.

How do I provide attribution when using the data?

If you use our datasets in a research paper, academic project, presentation, or non-commercial tool, we require a simple attribution. Please include a statement such as: "This work utilizes datasets provided by GuardLabs (guardlabs.online/datasets/)" in your publication's methodology or acknowledgments section. For online use, a link back to this page is sufficient. This helps other researchers find the source data and encourages a cycle of open research.

Are these datasets related to your other services?

Yes, in a cyclical way. The insights we gain from analyzing this data help us refine the tools and techniques used in our web audit and care plan services. In turn, our services provide us with broad, real-world data that, once anonymized and aggregated, can be used to create new datasets for the community. The datasets themselves, however, are a separate, non-commercial offering focused on supporting the wider research community.

WordPress Plugin CVE Trends 2026

Public Datasets for Security and Anti-Fraud Research

Available Datasets and Formats

License and Usage Guidelines

Frequently asked questions