SIGWAC
ACL Special Interest Group

Web as
Corpus

Researching the opportunities and limitations of using textual web data for linguistic analysis, language modelling, and knowledge representation.

About SIGWAC

The community for web corpus research

SIGWAC is the Special Interest Group of the Association for Computational Linguistics on Web as Corpus. We research the opportunities and limitations of using textual web data across linguistic and computational domains.

Given ever-growing data needs of Large Language Models, Web Corpora have taken a central place in Natural Language Processing, Computational Linguistics, and Machine Learning. SIGWAC has organised its topics of interest across technical, legal, and societal dimensions to reflect this.

We promote interest in the web as both a source of linguistic data and an object of study in its own right, providing members of the ACL with a means to exchange research developments and news.

Constitution of ACL SIGWAC

Objectives

  • 01Build a community around web-as-corpus research
  • 02Support and promote information exchange and the dissemination of results and best practices
  • 03Organise workshops, hackathons, and shared tasks
  • 04Sponsor meetings and workshops on the web as corpus that are timely and worthwhile

Stay in the loop

Sign up to the mailing list to receive news, announcements, and calls for participation from the SIGWAC community.

Subscribe to the mailing list
Scope

Topics of interest

SIGWAC's research spans three interconnected dimensions as web corpora become central to modern NLP and ML.

Technical

  • Filtering strategies for web data in LLM pre-training
  • Impact of web data in the pre-training data mix of LLMs
  • Crawling and ranking
  • Construction of web graphs
  • Language identification, multilinguality, and web as corpus for low-resource languages
  • Web indexing, information retrieval, and LLM application in document representations
  • Semantic web and automatic annotation of multilingual web data

Societal

  • Socio-linguistic studies of web data
  • Web-graph as a tool for corpus exploration in a multidisciplinary setting
  • Study of bias and toxicity in web corpora
  • Study of illegal content prevalence in web corpora
  • Web corpora as a means to promote multilingualism and multiculturalism
Leadership

Officers

NL

Nikola Ljubešić

Co-Secretary
BS

Benoît Sagot

Co-Secretary
VL

Veronika Laippala

Co-President
PO

Pedro Ortiz Suarez

Co-President
Resources

Corpora & tools

History

Workshops & meetings

The WAC workshop series has run since 2005, co-located with major NLP and linguistics conferences around the world.

2020 WAC-XIICancelled LREC 2020, Marseille, 16 May. Proceedings published.
2017 WAC-XI Corpus Linguistics 2017, Birmingham, 24–27 July
2016 WAC-X ACL 2016, Berlin, 12 August
2015 WAC@eLex eLex, Herstmonceux Castle, UK, 10 August
2014 WAC9 EACL 2014, Gothenburg, 26–27 April
2013 WAC8 Corpus Linguistics 2013, Lancaster, 22 July
2012 WAC7 WWW12, Lyon, 17 April
2011 BUCC ACL 2011, Portland, Oregon, 24 June
2010 WAC6 NAACL-HLT, Los Angeles, 5 June
2009 WAC5 SPLN, San Sebastián, Basque Country, 7 September
2008 WAC4 LREC, Marrakech, 1 June
2007 WAC3 Louvain-la-Neuve, Belgium, 15–16 September
2006 WAC2 EACL, Trento, Italy,April
2005 WAC1 Corpus Linguistics, Birmingham,July