Strategic and Competitive Intelligence

This project is a strategic and competitive intelligence analysis of the coding and software development field, in relation to generative AI, through social media discourse analysis. Conducted as part of a university course, the research explores how conversations about coding, emerging technologies, and professional practices unfold across multiple digital platforms including Twitter , Dev.to , Reddit , and Stack Overflow .

The project addresses four critical research questions that provide insights into the current state and future direction of the software development industry. Through natural language processing, text mining, and data visualization techniques, the analysis uncovers patterns in how developers discuss innovations, evaluate companies, and adapt to changing work environments. The methodology combines quantitative analysis of large-scale datasets with qualitative interpretation of emerging trends, resulting in actionable intelligence for understanding the competitive landscape of technology companies and coding practices.

The deliverable includes an interactive R Shiny dashboard that enables users to explore the findings through dynamic visualizations, making the intelligence accessible and actionable for various stakeholders including technology companies, educators, and industry analysts.

This project is my first hands-on experience both using R and in applying strategic and competitive intelligence methodologies to real-world data sources. I learned how to transform unstructured social media data into actionable business intelligence through systematic analysis and visualization.

One of the most significant learnings was mastering the entire cycle: from identifying relevant information sources to collecting, processing, analyzing, and disseminating findings. I developed proficiency in working with multiple APIs including Twitter, Dev.to, and Reddit, understanding their limitations and optimizing data collection strategies within rate limits and access constraints.

The project deepened my understanding of natural language processing techniques, particularly named entity recognition, sentiment analysis, and text classification using libraries like NLTK and spaCy . I learned to filter noise from signal in large datasets containing millions of tweets, identifying meaningful patterns while managing computational resources efficiently.

Building the R Shiny dashboard taught me the importance of data storytelling and user-centered design in intelligence reporting. I learned to translate complex analytical findings into visualizations including word clouds, treemaps , and comparative charts that enable stakeholders to derive insights quickly.

Perhaps most importantly, I learned the value of iterative refinement in intelligence gathering. The process of filtering company names, validating results through manual review, and continuously improving data quality demonstrated that competitive intelligence requires combining automated analysis with human judgment to produce reliable, actionable insights. This project reinforced that effective intelligence work is as much about critical thinking and domain knowledge as it is about technical skills.

The project was structured around six research questions that guide the analysis of coding discourse and technology trends:

What organizations are mentioned most often in the genAI for coding public discourse?

The takeaway from the data we gathered is that the organizations mentioned most often in the genAI for coding public discourse are mostly very big tech companies. There are some minor differences between the data extracted from Twitter and Reddit but not really meaningful ones since the most cited are always the same. In the list we found there are some interesting names that stand out when considering the names that one would assume to be more related to the generative AI public discussions (such as NVIDIA) and one can observe them in the treemap above.

What innovative approaches or methodologies are emerging in the
field of coding? — What innovative approaches or methodologies are emerging in the field of coding?

We found that on Twitter, discussions revolve mostly around AI tech like "chatgpt", as well as ongoing tech like web3, IoT, AR, and metaverse. The presence of "python" and "data" shows an interest in Data Science. Conversely, Reddit focuses more on specific companies like Comcast and Google. Important to note is that AI and cybersecurity are prevalent, as seen by mentions of terms like NSA and WannaCry. Dev.To data shows to be more developer-centric, focusing on tools and programming languages most used by professionals. Such can be seen by the high mentions of terms like LLM, JMX, and Node.js. What can be seen from all datasets is that AI and its derivations are a main part of the emerging technology landscape of today.

Which programming languages were used before and after ChatGPT?

This analysis gathers data from Dev.to and StackOverflow. The above data represents mentions that programming languages received in 2022 and 2023, before and after the release of ChatGPT. These mentions are generative AI agnostic. The analysis highlights small changes in the mention of programming languages before and after the release of ChatGPT. According to the data, the pace of innovation in Generative AI technology is higher than the pace of change in the adoption of programming languages. Limitations: ChatGPT was released in November 2022, hence this analysis only takes into consideration two years, 2022 and 2023. It would be interesting for future studies to look at a wider window of time.

How are discussions about coding influenced by trends in remote
work, digital nomadism, and gig economy employment, from the point of views
of coding professionals? — How are discussions about coding influenced by trends in remote work, digital nomadism, and gig economy employment, from the point of views of coding professionals?

Based on the data from Dev.To, it was observed that the prevailing opinions regarding remote working were predominantly positive. This indicates a favorable disposition towards emerging digital nomadism work arrangements. The main sentiments described this arrangement as easy, new, and available. Regarding Twitter the data shows that the predominant feelings are neutral and positive, that confirms the outcome of the research made on the Dev.to dataset. The topics that are mostly talked about in those discussions are visible in the treemap above.

Based on online conversations and reviews, users are most satisfied with Similarweb and Semrush when it comes to competitive intelligence tools and software solutions. This aligns with the findings of Lopes et al. (2023), who conducted a bibliometric analysis to assess competitive intelligence and business intelligence concepts. In terms of functionality, Similarweb is often praised for its market research capabilities, Sprout Social for its social listening features, Ahrefs for its SEO tools, and Semrush is frequently mentioned as a comprehensive all-in-one solution. This is consistent with the systematic literature review by Hatzijordanou et al. (2019) on competitor analysis, which identifies and compares major market measurements that help distinguish the services and goods of the competitors. As for ease of integration, the general consensus is that these tools should ideally be able to integrate with existing systems like Salesforce and provide workbench capabilities for consolidated analysis. However, the best tool ultimately depends on specific requirements and budget. It's always recommended to read user reviews before making a decision.

Sources

Lopes, B. de S., Amorim, V., Au-Yong-Oliveira, M., & Lima Rua, O. (2023). Competitive and Business Intelligence: A Bibliometric Analysis. In Quality Innovation and Sustainability (pp. 187–197). https://link.springer.com/chapter/10.1007/978-3-031-12914-8_15
Hatzijordanou, N., Bohn, N., & Terzidis, O. (2019). A systematic literature review on competitor analysis: status quo and start-up specifics. Management Review Quarterly, 69, 415–458. https://link.springer.com/article/10.1007/s11301-019-00158-5
G2. (2024). Best Competitive Intelligence Tools in 2024. https://www.g2.com/categories/competitive-intelligence
G2. (2024). Best Enterprise Competitive Intelligence Tools in 2024. https://www.g2.com/categories/competitive-intelligence/enterprise
G2. (2024). Best Competitive Intelligence Tools for Small Business. https://www.g2.com/categories/competitive-intelligence/small-business
G2. (2024). Best Competitive Intelligence Tools for Medium-Sized Businesses. https://www.g2.com/categories/competitive-intelligence/mid-market
Zapier. (2024). The 9 best competitor analysis tools in 2024. https://zapier.com/blog/competitor-analysis-tools/
Evalueserve. (n.d.). Competitive Intelligence Solutions: 5 Ways to Ease Your Job. https://www.evalueserve.com/blog/competitive-intelligence-solutions-easier-job/
Content Boomerang. (n.d.). Competitive Intelligence Tools: Key Solutions for Strategic Business. https://contentboomerang.com/blog/competitive-intelligence-tools/
Semrush. (n.d.). The 14 Best Competitive Intelligence Tools for Market Research. https://www.semrush.com/blog/best-competitive-intelligence-tools/

1. Quantitative attempt: We first tried to answer this question using the same strategy of the previous answers. Definition of key words, filtering of relevant tweets, aggregation of data. In this case we tried filtering tweets by ethical key words and names of competitive intelligence tools. After that, we wanted to obtain the median value on the variable sentiment, in order to do a sentiment analysis regarding the perception of ethics in these tools.

2. Missing data: Tools of competitive intelligence are mostly using simple machine learning algorithms. Of the 22 tools we found, all of them were not exploiting the potential of generative AI.

3. Role of NLU: Researches are being conducted in the arising field of tools for strategic and competitive intelligence using generative AI, but mostly are focused on the use of these powerful models as NLU, in order to process huge quantities of text data with greater understanding.

4. Ethical implications: Ethical implications in this field regard therefore mostly the issues arising by conducting web scraping practices, such as illegal access and use of data, breach of contract, copyright, trespass to chattels, and trade secrets.

Sources

De Los Reyes, D., Trajano, D., Manssour, I., Vieira, R., & Bordini, R. (2021). Entity Relation Extraction from News Articles in Portuguese for Competitive Intelligence Based on BERT. In Advances in Artificial Intelligence (pp. 431-442). Springer. https://doi.org/10.1007/978-3-030-91699-2_31
Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: Legality and Ethics of Web Scraping. Communications of the Association for Information Systems, 47, 24. https://doi.org/10.17705/1CAIS.04724

The project spanned large-scale data collection, processing, and visualization across multiple platforms and data sources.

Data gathering utilized multiple APIs and web scraping techniques. The Twitter analysis leveraged a pre-existing ChatGPT Tweets Dataset containing 3.8 million records with rich metadata including subjects, objects, topics, and job profiles. For Dev.to, the official API was used with custom pagination logic to retrieve articles matching specific tags and search queries, collecting data across 10 pages with 1,000 articles per page. Reddit data collection employed the PRAW library to search the technology subreddit, retrieving posts and up to 1,000 top comments per post ranked by score. Stack Overflow survey data was obtained through web crawling of official survey result pages.

The NLP pipeline integrated multiple libraries for different analysis tasks. NLTK provided tokenization, stopword removal, and part-of-speech tagging for extracting noun phrases and identifying technical terminology. spaCy's models enabled named entity recognition to identify companies, technologies, and proper nouns within text. Custom filtering logic combined regex patterns with predefined keyword lists to isolate coding-related content from the broader dataset. For sentiment analysis, the project leveraged spaCy's linguistic features to identify adjectives and sentiment-bearing terms.

Given the massive scale of the Twitter dataset (3.8M rows), efficient filtering was critical. The implementation used pandas for vectorized operations, filtering the dataset by creating lowercase column versions and applying regex patterns for 100+ coding-related keywords. This reduced the dataset to approximately 1.47 million coding-relevant tweets. For company name extraction, the process involved multiple stages: initial automated extraction, frequency counting, filtering by minimum thresholds, and manual validation through an interactive Python script that prompted review of 200 companies.

You can explore all the jupyter notebooks used in this phase at this link

The R Shiny dashboard serves as the primary interface for exploring findings. Built with a modular structure (ui.R, server.R, app.R), the dashboard implements reactive programming to handle user interactions efficiently. Visualization libraries include plotly for interactive charts, wordcloud2 for frequency visualizations, ggplot2 for statistical graphics, and treemap for hierarchical data representation. The dashboard loads data reactively, rendering visualizations only when needed to optimize performance.

The project demonstrates effective integration of Python for data processing and R for visualization. Python notebooks handled ETL operations, API interactions, and NLP tasks, exporting cleaned datasets as CSV files. R Shiny handles these outputs to create the final analytical interface. This multi-language approach used the strengths of each ecosystem: Python's extensive API and NLP libraries for data gathering, and R's visualization capabilities for presentation.

The project encountered numerous technical and methodological challenges that required creative problem-solving and adaptive strategies.

Working with 3.8 million tweet records strained computational resources and processing time. The initial naive approach of iterating through rows proved impractical, taking hours to process. The solution involved using pandas' vectorized operations, creating lowercase column versions for case-insensitive matching, and applying regex patterns across entire columns simultaneously. This optimization reduced processing time from hours to minutes. For company name extraction, implementing incremental processing and saving intermediate results protected against data loss from crashes or interruptions.

Each API imposed different constraints on data access. Dev.to's beta API had limited filtering accuracy despite accepting tag and query parameters, requiring additional client-side filtering. Reddit's API limited comment retrieval, necessitating prioritization strategies to capture top comments by score. The solution implemented pagination with appropriate delays between requests, retry logic for failed requests, and batch processing to respect rate limits while maximizing data collection efficiency.

Off-the-shelf NER models produced significant false positives when identifying company names. Common English words like "next," "test," "nothing," "yes," and "current" were incorrectly classified as organizations. Conversely, some actual companies like Binance and Nothing (the actual tech company) required special handling. The solution implemented a multi-stage validation process: automated extraction with frequency thresholds, manual review through an interactive script, and frequency adjustment factors for ambiguous terms. Words that were both companies and common terms received reduced frequency weights (divided by 50 instead of 300) to account for probable legitimate mentions.

Different platforms structured data differently, complicating comparative analysis. Twitter provided subjects, objects, and predicates; Dev.to offered tags and descriptions; Reddit had titles and comments; Stack Overflow used survey questions. The solution standardized data through platform-specific preprocessing pipelines that extracted comparable features: textual content, metadata categories, temporal information, and frequency counts. Creating normalized CSV exports enabled consistent analysis in the R Shiny dashboard regardless of source platform.

As an additional contribution to the Strategic and Competitive Intelligence project, I developed a sophisticated R-based tool for extracting structured data from Stack Overflow. This custom scraper served us for the data gathering phase.

The scraper implements a modular architecture with clear separation of concerns across seven R scripts. The main.r file orchestrates the entire workflow, coordinating query formulation, link collection, data extraction, and output generation. The classes.r module defines R6 object-oriented structures for Questions, Answers, and Comments, enabling hierarchical data modeling that preserves the natural structure of Stack Overflow content. The queryOptions.r system handles advanced query formulation, supporting all Stack Overflow search operators including tags, score thresholds, view counts, and code snippets.

The tool supports filtering through Stack Overflow's search syntax, enabling users to construct precise queries. Users can filter by tags like python or opencv while excluding others, specify minimum scores to focus on high-quality questions, require minimum answer counts to ensure solutions exist, and sort by view counts to identify widely-encountered problems. The system can even search for specific code patterns, enabling identification of how particular APIs or functions are being used in practice. Support for closed/duplicate filtering helps focus on active, unique questions. The query builder works both interactively through prompts and programmatically through parameter passing, providing flexibility for different use cases.

The scraping engine leverages the rvest package for robust HTML parsing with XPath and CSS selectors. The getAllQuestionsLinks.r module implements intelligent pagination, automatically requesting additional pages until the desired number of results is collected, with each page containing up to 50 questions. Captcha detection and handling is built-in, with the system pausing when it detects a challenge and prompting the user to solve it before continuing. The extractDataFromQuestionPage.r module performs deep extraction from individual question pages, parsing the hierarchical structure of questions, their multiple answers, and nested comments.

For each question, the scraper extracts the full title, vote count indicating community value, complete question text preserving formatting, all code snippets separately identified, and total answer count. For each answer, it captures the unique answer ID for tracking, complete answer text with formatting, all code blocks within the answer, vote count showing community validation, and the full comment thread with user attribution.

The tool generates outputs optimized for different analysis scenarios.

CSV format exports tabular data including question titles, votes, answer statistics, and aggregate metrics like total votes across all answers and comment counts. This format is ideal for quantitative analysis, statistical processing, and dashboard integration.

YAML format preserves the complete hierarchical structure including full text content of questions, answers, and comments, code snippets with proper formatting, and nested comment threads. This format is perfect for qualitative analysis, content mining, and detailed case studies.

JSON format stores the collected question links, enabling checkpointing and reproducibility of data collection sessions.