Pandas Cheat Sheet

1. Collect dataframe as dictionary .set_index([‘a’,‘b’]).T.to_dict(‘list’) 2. Read in csv file format(transpose) https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html 3. count rows with same value df[col].value_counts() 4. display columns unlimit ( equivalent to spark df, with limit=False) pd.set_option('display.expand_frame_repr', False) Other settings: pd.set_option('display.height', 1000) pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000) 5. remove all the rows with a value occur less than n times df[df.groupby(value).uid.transform(len) > n] or: df.groupby(by=value).filter(lambda x: len(x) > n) »

Author image J

Performance

Table of Contents: * [Edge Computing]() * [Content Delivery Network]() * Server Side * Gateway: nginx, openResty * long alive connection * optimized, cheap, fast * reverse proxy * load balancer * [Application Side]() * [Algorithms]() * Disk I/O * Async * Disk Performance * SSD * RAID 0, 1, 5, 10 * Network I/O * zip message * use a better network card & adaptor [Database Side]() In memory cache -Redis/memcached »

Author image J

DistributedComputing Notes

Mostly Hadoop, Spark As well as concepts, ————- think at scale! Spark 1. Code Optimization 1. avoid creating duplicate rdds 2. avoid re-calcualtion, reuse same rdd as much as you can, use cache() to prevent re calculation, (cache rdd in memory) use persist() to manually set different level of cache, ex.StorageLevel.MEMORY_AND_DISK_SER others: MEMORY_ONLY MEMORY_AND_DISK MEMORY_ONLY_SER MEMORY_AND_DISK_SER: cache to memory first, if memory is not enough, write to disk »

Author image J

Java Notes

Java NOtes and cheat sheet I wrote during work and academia 1. ArrayList ArrayList is implemented based on primitive array Feature: ArrayList myarrlist = new ArrayList() 1.Type Satety ArrayList provides stronger type safety ensurance than array https://coderanch.com/t/625190/certification/Array-ArrayList-Thread-safety-Type 2. Flexibility ArrayList has Dynamic, and a better interface 3. Size vs length Size is the capacity length is the actually length 4. Multi-dimension arraylist doesn’t support multi-dimension, array does 5, primitive types ArrayList doesn’t support primitive types »

Author image J

Scala Notes

I wrote more than 5000 lines of Scala code about 1 year ago, but later on I don’t have any chance to write it. (University doesn’t teach it, previous internships don’t use it) Now I’m picking it up as the university teaches spark by Scala, but I feel that I forgot lots of not only syntax but also some key knowledge about Scala. Therefore I want to write some Scala notes. »

Author image J

PythonNotes Cheat notes for Pyspark, Pandas, Numpy, NLTK, Keras, Tensorflow and etc »

Author image J