ACL SIGWAC: Special Interest Group on Web as Corpus

About SIGWAC

The community for web corpus research

SIGWAC is the Special Interest Group of the Association for Computational Linguistics on Web as Corpus. We research the opportunities and limitations of using textual web data across linguistic and computational domains.

Given ever-growing data needs of Large Language Models, Web Corpora have taken a central place in Natural Language Processing, Computational Linguistics, and Machine Learning. SIGWAC has organised its topics of interest across technical, legal, and societal dimensions to reflect this.

We promote interest in the web as both a source of linguistic data and an object of study in its own right, providing members of the ACL with a means to exchange research developments and news.

Constitution of ACL SIGWAC

Objectives

01Build a community around web-as-corpus research
02Support and promote information exchange and the dissemination of results and best practices
03Organise workshops, hackathons, and shared tasks
04Sponsor meetings and workshops on the web as corpus that are timely and worthwhile

Stay in the loop

Subscribe to the mailing list

Scope

Topics of interest

SIGWAC's research spans three interconnected dimensions as web corpora become central to modern NLP and ML.

Technical

Filtering strategies for web data in LLM pre-training
Impact of web data in the pre-training data mix of LLMs
Crawling and ranking
Construction of web graphs
Language identification, multilinguality, and web as corpus for low-resource languages
Web indexing, information retrieval, and LLM application in document representations
Semantic web and automatic annotation of multilingual web data

Legal

Intellectual property and licensing of web data
Robot exclusion protocol and other opt-out methods for AI training
Privacy preservation in web corpora, automatic PII detection and redaction
Study and application of the TDM directive in the EU
Study and application of the AI Act in the EU
Scope of data usage

Societal

Socio-linguistic studies of web data
Web-graph as a tool for corpus exploration in a multidisciplinary setting
Study of bias and toxicity in web corpora
Study of illegal content prevalence in web corpora
Web corpora as a means to promote multilingualism and multiculturalism

2026	WaC-13Upcoming	EMNLP 2026, Budapest, 24–29 October. Call for papers open.
2020	WAC-XIICancelled	LREC 2020, Marseille, 16 May. Proceedings published.
2017	WAC-XI	Corpus Linguistics 2017, Birmingham, 24–27 July
2016	WAC-X	ACL 2016, Berlin, 12 August
2015	WAC@eLex	eLex, Herstmonceux Castle, UK, 10 August
2014	WAC9	EACL 2014, Gothenburg, 26–27 April
2013	WAC8	Corpus Linguistics 2013, Lancaster, 22 July
2012	WAC7	WWW12, Lyon, 17 April
2011	BUCC	ACL 2011, Portland, Oregon, 24 June
2010	WAC6	NAACL-HLT, Los Angeles, 5 June
2009	WAC5	SPLN, San Sebastián, Basque Country, 7 September
2008	WAC4	LREC, Marrakech, 1 June
2007	WAC3	Louvain-la-Neuve, Belgium, 15–16 September
2006	WAC2	EACL, Trento, Italy,April
2005	WAC1	Corpus Linguistics, Birmingham,July

Web as
Corpus

The community for web corpus research

Stay in the loop

Topics of interest

Technical

Legal

Societal

Officers

Nikola Ljubešić

Benoît Sagot

Veronika Laippala

Pedro Ortiz Suarez

Corpora & tools

Corpora

Technologies

Further reading

Workshops & meetings

Annual reports