Python 1

I mainly use Python to assist me in web crawling, data preprocessing and variable construction. Here are some examples of programming projects I’ve completed:

  1. I developed a web crawler using the Selenium package (which is used to automate web browser interaction from Python) to collect patent data from the Derwent Innovations Index (DII) database, including detailed information on forward and backward citations of patent applications from relevant companies; The purpose of this project was to quantify the disruptive innovation, following the measure adopted by Wu et al. (2019) in Nature(Link: https://www.nature.com/articles/s41586-022-05543-x). The underlying machine is that if a paper or patent is disruptive, the subsequent work that cites it is less likely to also cite its predecessors; for future researchers, the ideas that went into its production are less relevant. If a paper or patent is consolidating, subsequent work that cites it is also more likely to cite its predecessors; for future researchers, the knowledge upon which the work builds is still (and perhaps more) relevant (for example, the theorems Kohn and Sham used). The CD index ranges from -1 (consolidating) to 1 (disruptive).
  1. I developed a Python program to employ the Word2Vec machine learning algorithm module, a natural language processing technology that is more advanced than the word frequency method used by Merkley (2014) (Link: https://publications.aaahq.org/accounting-review/article-abstract/89/2/725/3609/Narrative-Disclosure-and-Earnings-Performance?redirectedFrom=fulltext), to expand the initial word set to be applicable to annual reports. The Word2Vec neural network model can characterize the words as multi-dimensional vectors according to the context, obtain thesimilarity of meaning between the words by calculating the similarity of words vector, and finally screen synonyms based on the level of similarity; The purpose of this project was to uses an initial word set and an extended word set trained by Word2Vec machining learning technology to measure the business digitalization from annual financial reports of Chinese listed firms. Because a proper selection of synonyms is critical in guaranteeing the accuracy of narrative innovation indicators, finding a ‘synonym dictionary’ applicable to financial corpus such as annual reports becomes one of the significant parts of this paper.
  1. I developed a Python program for sentiment analysis, specifically designed to analyze the sentiment tone in corporate social responsibility (CSR) reports. The program utilizes HanLP’s pre-trained Chinese model for tokenization and dependency parsing. It also loads positive and negative word lists from a financial sentiment dictionary. For each report, the program reads the text, removes spaces and newline characters, and splits the content into sentences using periods as delimiters; For each sentence, the program tokenizes it using HanLP and performs dependency parsing to detect negation structures. It then counts the total number of words, positive words, and negative words based on the sentiment dictionary, considering the impact of negation (i.e., when a positive word appears in a negated context or when a negative word is negated, changing its polarity). The final results, including total word count, positive word count, and negative word count, are written to a CSV file, with each row representing the report’s ID, name, and sentiment statistics.
滚动至顶部